Preview

Start a free Courses trial
to watch this video

The Endless Web

Name: The Endless Web
Uploaded: 2018-08-22T07:36:01-04:00
Duration: 418 s
Description: Let's further explore how to crawl the web.

6:58 with Ken Alger

Let's further explore how to crawl the web.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Six Degrees of Separation

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

With our first spider, Ike, we saw how to process a static list of URLs. 0:00

This is great if you know all the URLs of the pages you want to scrape. 0:05

What happens though, 0:10

when you want to start following links that are included on the page itself? 0:11

Scrapy has some helpful methods for handling these situations, 0:14

with link extractor, and crawl spider classes. 0:18

A word of caution here, before we crawl down this path, we need to be aware of 0:22

the overwhelming amount of data and sites that are connected on the web. 0:26

Writing a spider that gets and follows all the links on each followed web page, 0:30

can lead to a program that never ends. 0:35

Also, with the idea that any given site is only six clicks away from any other site, 0:38

sending a spider on a massive crawling task can potentially 0:45

lead to some sites that are way off our originally intended topic. 0:49

We should look at setting up some rules for our spider to follow as well. 0:53

The CrawlSpider class from Scrapy is set up a bit differently than the spider we wrote 0:57

in the last video. 1:02

It has the same overall concept, but instead of a start_requests method, 1:03

we define allowed_domains and start_urls. 1:08

Then we'll define a set of rules for a spider to follow. 1:12

This lets us tell the spider which links to match or not, follow or not, and 1:15

how to parse the information. 1:19

Let's take a look at how to implement these concepts in a new spider. 1:21

Let's create a new file in our spider's folder and call it crawler.py. 1:26

Crawler.py, and we need a few imports. 1:33

So from scrappy.linkextractors, 1:36

import LinkExtractor. 1:42

And, from scrappy spiders, scrappy.spiders, 1:46

we want to import CrawlSpider and Rule. 1:51

Next, we define our class, this time inheriting from CrawlSpider. 1:55

We'll name this one after another famous horse, Whirlaway. 2:05

Perhaps not quite the same as Ike from Charlotte's Web, but 2:09

a winner in his own right. 2:13

When using the CrawlSpider class, we can set a few parameters for it to follow. 2:19

Let's start within allowed domain's limit, 2:23

to prevent our spider from getting too far out of control. 2:26

So we do allowed domains, we can pass in a list, for 2:29

ours we'll just do treehouse-projects, github.io. 2:34

Next, we define a place to start. 2:44

So we do start_urls, and 2:46

we want treehouse-projects 2:53

Github.io/horse-land. 2:59

Now we can define our rules. 3:04

We'll use the LinkExtractor class and 3:06

pass in a regular expression of links to follow or ignore. 3:09

So our rules be rule, 3:13

LinkExtractor, and our regular expression. 3:16

Then we tell our rule how to parse the information by assigning the call back 3:24

parameter to the method name. 3:28

Let's use parse_horses, so callback, parse_horses, 3:30

Then we tell the rule if it's okay to follow the links. 3:39

follow=True. 3:45

And let's clean this up a little bit. 3:45

Drop these down onto new lines, Now we can define our parsing method. 3:50

parse_horses will take self, and the response, we'll grab the page URL. 3:59

And the page title. 4:11

We can use CSS to select specific page elements. 4:12

The result of running a response.css title, 4:21

is a list like object called selector list, which represents 4:24

a list of selector objects that wrap around XML or HTML elements. 4:29

And allow you to run further queries to fine grain the selection or 4:35

extract the data. 4:39

For this example, let's just print out the URL and titles. 4:40

We'll print Page URL, 4:43

Format(url), and we'll print the page title. 4:50

Go to the terminal, and we'll ask Scrapy to crawl our site, crawl Whirlaway. 5:02

Need to be in the right directory. 5:15

crawl Whirlaway. 5:30

And there's our information. 5:34

Scroll up here. 5:35

So again, we see that we got a 404 when it was looking for the robots.txt. 5:42

Page URL, page title, it's kinda messy. 5:49

We can clean that up a little bit. 5:53

We only want to extract the text elements directly inside the title element, so 5:55

let's change that up here. 5:59

So title, we want text, and we want to extract it. 6:01

And we'll run it again. 6:10

Come up here, there's our page title, that's much better. 6:15

Also note here in the output, that Scrapy found those external links but 6:18

filtered them out. 6:23

Thanks, Scrapy. 6:24

Well done, you've written two different spiders now. 6:28

One that follows links that we provide and, 6:31

one that extracts links from a site and follows them based on rules we set. 6:34

These are both very powerful tools for scraping data from the web. 6:40

Being able to get the information is a major task, and 6:44

we've seen how easy scraping makes it. 6:47

In the next stage, let's take a look at how to handle some other common tasks, 6:50

such as handling forms and interacting with APIs. 6:55

You need to sign up for Treehouse in order to download course files.

You need to sign up for Treehouse in order to set up Workspace

Scraping Data From the Web

The Endless Web

Related Discussions

Related Discussions