Crawling Spiders5:26 with Ken Alger
Let's use the Python Library, Scrapy, to create a spider to crawl the web.
Inside the spiders folder, let's create a new file to crawl our sample horse site. 0:00 So Spiders > File > New File > Python, horse.py. 0:06 To start with, we need to import scrapy. 0:14 And then we'll write a class, we'll call it HorseSpider, 0:20 which we'll inherit from scrapy.Spider. 0:26 Now, we need to give out HorseSpider a name, let's call it ike, 0:30 after the horse in Charlotte's Web. 0:34 Spider names must be unique within a scrapy project. 0:39 So scrapy knows which spider to run in the project. 0:42 We'll use it to run our spider in just a bit. 0:46 There are two functions we need to write in here. 0:48 Start request, which defines the initial request to be made and 0:51 if applicable, how to follow links. 0:55 So we'll define start_requests. 0:59 For now, we'll just pass. 1:01 The other function is parse which will tell the spider how extracted data is to 1:03 be parsed. 1:08 Inside start_requests, we provide a list of URLs that we want to process. 1:11 So urls, and we pass in a list. 1:17 So we'll pass in our index and our mustang.html pages. 1:21 The whole URL is treehouse-projects. 1:26 Github.io/horseland/index.html. 1:33 We'll paste that in and change it to mustang. 1:43 Then we need to return a scrapy.Request. 1:49 This is a list comprehension. 1:52 It's going to create a new list of request by looping to each of our URLs. 1:54 More on the teacher's notes. 2:00 So we wanna return a list of scrapy.Request. 2:01 We want our url to be url, our callback is gonna be self.parse. 2:10 We want that for urls in urls. 2:17 This line is looping through our urls list and on each one calling the parse method. 2:21 Let's update that method to do something. 2:26 We could do a lot of things inside this method. 2:31 How you parse the data on a site will be highly dependent 2:33 on the purpose of your project, since every use case can be a little different. 2:37 For now, let's just save the entire HTML file. 2:42 So we'll define a url, the response.url. 2:46 This response object represents an HTTP response 2:50 from the request we made in start_requests. 2:54 It's usually downloaded by the downloader and fed to the spiders for processing. 2:57 See the teacher's notes for additional documentation on scrapy's response object. 3:03 So with our url, we wanna get a specific page. 3:08 We'll split it, On our last slash there, 3:13 and our file name we'll call it horses. 3:20 We'll format that with our page and we'll print out what the URL is, 3:25 And then we'll save our page. 3:36 I'm going to just write the entire response body. 3:44 Then we'll print out the saved file name. 3:49 Nice, now in a terminal window, We navigate to our spider's directory. 3:57 And tell scrapy to crawl using our spider name. 4:11 So we do scrapy crawl ike. 4:15 If we look at output in our terminal, we can find, 4:19 come up here a little bit, To right in here. 4:23 We see that the spider looked for 4:29 our robots.txt file, which it didn't find since the site doesn't have one. 4:30 See this 404 code here? 4:35 In our robots.txt, the pages 4:37 that we included in our URLs list were found and saved from the parse method. 4:41 There's the URLs, there's the file names, 4:46 we'll come back up here, there they are, very nice. 4:50 Great work on writing your first spider. 4:55 We saw the two methods that a scrapy spider needs, start requests and parse. 4:58 We put in a list of URLs in the start_requests method and 5:05 have it loop through that list and process each URL with the parse method. 5:08 We could have our parse method do something more powerful 5:13 than just saving the entire file. 5:16 But this is a nice start. 5:18 Next up though, 5:20 let's see how to write a spider that will crawl more URLs than what we give it. 5:21
You need to sign up for Treehouse in order to download course files.Sign up