Everyone Loves Charlotte6:57 with Ken Alger
We've seen how to scrape data from a single page. Now let's see how we can capture links on one page and follow them to process additional pages.
[MUSIC] 0:00 We've seen how we can scrape data from a single page and 0:04 isolate all the links on that page. 0:07 We can utilize that and start moving off a single page and 0:10 onto multiple pages, or crawling the web. 0:13 The internet constitutes over 4.5 billion pages connected together with hyperlinks. 0:16 Web crawling is, for our purposes, the practice of moving between these connected 0:23 web pages and crawling along the paths of hyperlinks. 0:28 This is where the power of automation comes into play. 0:31 We can write our application to look at a page, scrape the data, 0:34 then follow the links, if any, on that page and scrape the next page, and so on. 0:38 Most webpages have both internal and external links on them. 0:44 Before we saddle up again and 0:48 get going in our code, let's think about web crawling at a high level. 0:50 We need to scrape a given page and generate a list of links to follow. 0:54 It's often a good idea to determine if a link is internal or external and 0:58 keep track of them separately. 1:03 We'll go through the list of links and separate them into internal and 1:04 external lists. 1:08 We'll check to see if we already have the link recorded, and if so, 1:10 it will be ignored. 1:13 If we don't have a record of seeing a particular link, we'll add it to our list. 1:15 We'll also looking at how to leverage the power of regular expressions to account 1:19 for things like URLs. 1:23 If you need a refresher on regular expressions in Python, 1:25 I know I occasionally do, check the teacher's notes to get a quick refresher. 1:28 When we last look at scraper.py, 1:33 we're getting all of the links from our horse land's main page. 1:35 Let's see how we can round up these links and put them to use. 1:39 Looking at the output from our previous run of scraper.py, we're getting this 1:43 internal link here for mustang.html and then all of these external links. 1:48 We can separate those out and follow them. 1:54 First, let's make a new file. 1:56 The new Python file, let's call it soup_follow_scraper. 2:02 I told you I'm bad at naming things. 2:08 We can minimize this. 2:10 And we'll bring in our imports, from urllib.request, 2:13 we want urlopen, from bs4 import BeautifulSoup. 2:20 And we'll be using regular expressions. 2:28 So let's import re to take care of that. 2:30 Let's make an internal links function that will take a link URL, internal_links. 2:36 We'll need to open our URL to define our html urlopen. 2:47 Inside here, we'll pass in the start of the URL and 2:53 format it with the internal URL we scrape from the page. 2:56 For our URL in our case, 3:01 treehouse-projects.github.io/horse-land and 3:04 our string formatter. 3:12 And we'll format it with a linkURL. 3:19 Next, we create our Beautiful Soup object. 3:23 Soup is BeautifulSoup, pass in our html, and 3:27 we'll use the same html parser we've been using, html.parser. 3:32 And we want to return the link of the soup object, 3:37 soup.find, and we want the anchor links. 3:39 We'll look for the anchor tags and use the HREF attribute of the find method 3:44 with a regular expression to just get the links that, in our case, end in .html. 3:50 It's inside here, re.compile, 3:56 our pattern is .html. 4:02 Let's put it to use. 4:06 So if dunder name, equals dunder main, 4:09 we want our urls to be in internal_links. 4:16 And we'll pass in our starting URL to the internal_links method, 4:20 and in our case, it's index.html. 4:24 And then we'll do a while loop. 4:28 So we'll have a length of our urls is greater than 0. 4:31 We want to capture the URL href. 4:36 Now we can do a lot of processing here, but for 4:44 now let's just print out the page information we get, print[page]. 4:46 And then we'll add a little bit of formatting. 4:50 Couple new lines in there, and 4:57 then we'll call our internal links method again, for the next link. 5:00 Internal_links(page), let's run it and see it in action. 5:09 Well, there we have it. 5:22 It's doing what we asked, but it's in an infinite loop. 5:24 Index.html is finally linked to mustang.html, which is finding 5:29 the link back to index.html, which is, well, you get the point. 5:34 Let's add in the list to keep track of our pages Call it site_links. 5:40 And then we'll adjust our while loop. 5:52 So if page not in site_links. 5:56 And then we'll add the pages to our list, 6:00 site_links.append(page). 6:04 We can indent all that. 6:08 Give us some more space. 6:12 So otherwise, we'll just break. 6:16 And let's run it again. 6:19 Page is not defined. 6:21 And pull that out. 6:25 Started my if statement too soon. 6:30 There we go, and we get the links that we were expecting. 6:34 External links are handled in a similar fashion, you do find the base url path, 6:37 and then, with regx define the pattern you're looking for and follow the links. 6:42 I'll saddle you with the responsibility to give it a try and 6:48 post your solution in the community. 6:52 Don't worry, I'm sure you can rein it in. 6:54
You need to sign up for Treehouse in order to download course files.Sign up