Join the Treehouse affiliate program and earn 25% recurring commission!

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Scraping Data From the Web!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Everyone Loves Charlotte

6:57 with Ken Alger

We've seen how to scrape data from a single page. Now let's see how we can capture links on one page and follow them to process additional pages.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Additional Resources

Regular Expressions in Python

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

[MUSIC] 0:00

We've seen how we can scrape data from a single page and 0:04

isolate all the links on that page. 0:07

We can utilize that and start moving off a single page and 0:10

onto multiple pages, or crawling the web. 0:13

The internet constitutes over 4.5 billion pages connected together with hyperlinks. 0:16

Web crawling is, for our purposes, the practice of moving between these connected 0:23

web pages and crawling along the paths of hyperlinks. 0:28

This is where the power of automation comes into play. 0:31

We can write our application to look at a page, scrape the data, 0:34

then follow the links, if any, on that page and scrape the next page, and so on. 0:38

Most webpages have both internal and external links on them. 0:44

Before we saddle up again and 0:48

get going in our code, let's think about web crawling at a high level. 0:50

We need to scrape a given page and generate a list of links to follow. 0:54

It's often a good idea to determine if a link is internal or external and 0:58

keep track of them separately. 1:03

We'll go through the list of links and separate them into internal and 1:04

external lists. 1:08

We'll check to see if we already have the link recorded, and if so, 1:10

it will be ignored. 1:13

If we don't have a record of seeing a particular link, we'll add it to our list. 1:15

We'll also looking at how to leverage the power of regular expressions to account 1:19

for things like URLs. 1:23

If you need a refresher on regular expressions in Python, 1:25

I know I occasionally do, check the teacher's notes to get a quick refresher. 1:28

When we last look at scraper.py, 1:33

we're getting all of the links from our horse land's main page. 1:35

Let's see how we can round up these links and put them to use. 1:39

Looking at the output from our previous run of scraper.py, we're getting this 1:43

internal link here for mustang.html and then all of these external links. 1:48

We can separate those out and follow them. 1:54

First, let's make a new file. 1:56

The new Python file, let's call it soup_follow_scraper. 2:02

I told you I'm bad at naming things. 2:08

We can minimize this. 2:10

And we'll bring in our imports, from urllib.request, 2:13

we want urlopen, from bs4 import BeautifulSoup. 2:20

And we'll be using regular expressions. 2:28

So let's import re to take care of that. 2:30

Let's make an internal links function that will take a link URL, internal_links. 2:36

We'll need to open our URL to define our html urlopen. 2:47

Inside here, we'll pass in the start of the URL and 2:53

format it with the internal URL we scrape from the page. 2:56

For our URL in our case, 3:01

treehouse-projects.github.io/horse-land and 3:04

our string formatter. 3:12

And we'll format it with a linkURL. 3:19

Next, we create our Beautiful Soup object. 3:23

Soup is BeautifulSoup, pass in our html, and 3:27

we'll use the same html parser we've been using, html.parser. 3:32

And we want to return the link of the soup object, 3:37

soup.find, and we want the anchor links. 3:39

We'll look for the anchor tags and use the HREF attribute of the find method 3:44

with a regular expression to just get the links that, in our case, end in .html. 3:50

It's inside here, re.compile, 3:56

our pattern is .html. 4:02

Let's put it to use. 4:06

So if dunder name, equals dunder main, 4:09

we want our urls to be in internal_links. 4:16

And we'll pass in our starting URL to the internal_links method, 4:20

and in our case, it's index.html. 4:24

And then we'll do a while loop. 4:28

So we'll have a length of our urls is greater than 0. 4:31

We want to capture the URL href. 4:36

Now we can do a lot of processing here, but for 4:44

now let's just print out the page information we get, print[page]. 4:46

And then we'll add a little bit of formatting. 4:50

Couple new lines in there, and 4:57

then we'll call our internal links method again, for the next link. 5:00

Internal_links(page), let's run it and see it in action. 5:09

Well, there we have it. 5:22

It's doing what we asked, but it's in an infinite loop. 5:24

Index.html is finally linked to mustang.html, which is finding 5:29

the link back to index.html, which is, well, you get the point. 5:34

Let's add in the list to keep track of our pages Call it site_links. 5:40

And then we'll adjust our while loop. 5:52

So if page not in site_links. 5:56

And then we'll add the pages to our list, 6:00

site_links.append(page). 6:04

We can indent all that. 6:08

Give us some more space. 6:12

So otherwise, we'll just break. 6:16

And let's run it again. 6:19

Page is not defined. 6:21

And pull that out. 6:25

Started my if statement too soon. 6:30

There we go, and we get the links that we were expecting. 6:34

External links are handled in a similar fashion, you do find the base url path, 6:37

and then, with regex define the pattern you're looking for and follow the links. 6:42

I'll saddle you with the responsibility to give it a try and 6:48

post your solution in the community. 6:52

Don't worry, I'm sure you can rein it in. 6:54

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up