The Endless Web6:58 with Ken Alger
Let's further explore how to crawl the web.
With our first biter, Ike, we saw how to process a static list of URLs. 0:00 This is great if you know all the URLs of the pages you want to scrape. 0:05 What happens though, 0:10 when you want to start following links that are included on the page itself? 0:11 Scrapy has some helpful methods for handling these situations, 0:14 with link extractor, and crawl spider classes. 0:18 A word of caution here, before we crawl down this path, we need to be aware of 0:22 the overwhelming amount of data and sites that are connected on the web. 0:26 Writing a spider that gets and follows all the links on each followed web page, 0:30 can lead to a program that never ends. 0:35 Also, with the idea that any given site is only six clicks away from any other site, 0:38 sending a spider on a massive crawling cast can potentially 0:45 lead to some sites that are way off our originally intended topic. 0:49 We should look at setting up some rules for our spider to follow as well. 0:53 The CrawSpider class from Scrapy is set up a bit differently than the spider we wrote 0:57 in the last video. 1:02 It has the same overall concept, but instead of a start_requests method, 1:03 we define allowed_domains and start_urls. 1:08 Then we'll define a set of rules for a spider to follow. 1:12 This lets us tell the spider which links to match or not, follow or we're not, and 1:15 how to parse the information. 1:19 Let's take a look at how to implement these concepts in a new spider. 1:21 Let's create a new file in our spider's folder and call it crawler.py. 1:26 Crawler.py, and we need a few imports. 1:33 So from_scrappy, linkextractors, 1:36 import LinkExtractor. 1:42 And, from scrappy spiders, scrappy.spiders, 1:46 we want to import CrawlSpider and Rule. 1:51 Next, we define our class, this time inheriting from CrawlSpider. 1:55 We'll name this one after another famous horse, Whirl Away. 2:05 Perhaps not quite the same as Ike from Charlotte's Web, but 2:09 a winner in his own right. 2:13 When using the CrawlSpider class, we can set a few parameters for it to follow. 2:19 Let's start within allowed domain's limit, 2:23 to prevent our spider from getting too far out of control. 2:26 So we do allowed domains, we can pass in a list, for 2:29 ours we'll just do treehouse-projects, github.io. 2:34 Next, we define a place to start. 2:44 So we do start_urls, And 2:46 we want treehouse-projects 2:53 Github.io/horse-land. 2:59 Now we can define our rules. 3:04 We'll use the LinkExtractor class and 3:06 parse in a regular expression of links to follow or ignore. 3:09 So our rules be rule, 3:13 LinkExtractor, And our regular expression. 3:16 Then we tell our rule how to parse the information by assigning the call back 3:24 parameter to the method name. 3:28 Let's use parse_horses, so callback, parse_horses, 3:30 Then we tell the rule if it's okay to follow the links. 3:39 Follow=true. 3:45 And let's clean this up a little bit. 3:45 Drop these down onto new lines, Now we can define our parsing method. 3:50 Parse-horses which take self, and the response, We'll grab the page URL. 3:59 And the page title. 4:11 We can use CSS to select specific page elements. 4:12 The result of running a response.css title, 4:21 is a list like object called selector list, which represents 4:24 a list of selector objects that wrap around XML or HTML elements. 4:29 And allow you to run further queries to fine grain the selection or 4:35 extract the data. 4:39 For this example, let's just print out the URL and titles. 4:40 We'll print Page URL, 4:43 Format(url), and we'll print the page title. 4:50 Go to the terminal, and we'll ask Scrapy to crawl our site, Crawl Whirlaway. 5:02 Need to be in the right directory. 5:15 Crawl Whirlaway. 5:30 And there's our information. 5:34 Scroll up here. 5:35 So again, we see that we got a 404 when it was looking for the robots.text. 5:42 Page URL, page title, it's kinda messy. 5:49 We can clean that up a little bit. 5:53 We only want to extract the text elements directly inside the title element, so 5:55 let's change that up here. 5:59 So title, we want text, and we want to extract it. 6:01 And we'll run it again. 6:10 Come up here, there's our page title, that's much better. 6:15 Also note here in the output, that Scrapy found those external links but 6:18 filtered them out. 6:23 Thanks, Scrapy. 6:24 Well done, you've written two different spiders now. 6:28 One that follows links that we provide and, 6:31 one that extracts links from a site and follows them based on rules we set. 6:34 These are both very powerful tools for scraping data from the web. 6:40 Being able to get the information is a major task, and 6:44 we've seen how easy scraping makes it. 6:47 In the next stage, let's take a look at how to handle some other common tasks, 6:50 such as handling forms and interacting with APIs. 6:55
You need to sign up for Treehouse in order to download course files.Sign up