1 00:00:00,000 --> 00:00:04,558 [MUSIC] 2 00:00:04,558 --> 00:00:07,686 We've seen how we can scrape data from a single page and 3 00:00:07,686 --> 00:00:10,370 isolate all the links on that page. 4 00:00:10,370 --> 00:00:13,430 We can utilize that and start moving off a single page and 5 00:00:13,430 --> 00:00:16,690 onto multiple pages, or crawling the web. 6 00:00:16,690 --> 00:00:23,390 The internet constitutes over 4.5 billion pages connected together with hyperlinks. 7 00:00:23,390 --> 00:00:28,005 Web crawling is, for our purposes, the practice of moving between these connected 8 00:00:28,005 --> 00:00:31,670 web pages and crawling along the paths of hyperlinks. 9 00:00:31,670 --> 00:00:34,780 This is where the power of automation comes into play. 10 00:00:34,780 --> 00:00:38,560 We can write our application to look at a page, scrape the data, 11 00:00:38,560 --> 00:00:43,330 then follow the links, if any, on that page and scrape the next page, and so on. 12 00:00:44,890 --> 00:00:48,850 Most webpages have both internal and external links on them. 13 00:00:48,850 --> 00:00:50,480 Before we saddle up again and 14 00:00:50,480 --> 00:00:54,870 get going in our code, let's think about web crawling at a high level. 15 00:00:54,870 --> 00:00:58,900 We need to scrape a given page and generate a list of links to follow. 16 00:00:58,900 --> 00:01:03,020 It's often a good idea to determine if a link is internal or external and 17 00:01:03,020 --> 00:01:04,890 keep track of them separately. 18 00:01:04,890 --> 00:01:08,380 We'll go through the list of links and separate them into internal and 19 00:01:08,380 --> 00:01:10,130 external lists. 20 00:01:10,130 --> 00:01:13,320 We'll check to see if we already have the link recorded, and if so, 21 00:01:13,320 --> 00:01:15,015 it will be ignored. 22 00:01:15,015 --> 00:01:18,715 If we don't have a record of seeing a particular link, we'll add it to our list. 23 00:01:19,745 --> 00:01:23,455 We'll also looking at how to leverage the power of regular expressions to account 24 00:01:23,455 --> 00:01:25,435 for things like URLs. 25 00:01:25,435 --> 00:01:28,255 If you need a refresher on regular expressions in Python, 26 00:01:28,255 --> 00:01:31,805 I know I occasionally do, check the teacher's notes to get a quick refresher. 27 00:01:33,145 --> 00:01:35,176 When we last look at scraper.py, 28 00:01:35,176 --> 00:01:39,670 we're getting all of the links from our horse land's main page. 29 00:01:39,670 --> 00:01:43,690 Let's see how we can round up these links and put them to use. 30 00:01:43,690 --> 00:01:48,430 Looking at the output from our previous run of scraper.py, we're getting this 31 00:01:48,430 --> 00:01:54,340 internal link here for mustang.html and then all of these external links. 32 00:01:54,340 --> 00:01:56,850 We can separate those out and follow them. 33 00:01:56,850 --> 00:01:58,446 First, let's make a new file. 34 00:02:02,190 --> 00:02:08,960 The new Python file, let's call it soup_follow_scraper. 35 00:02:08,960 --> 00:02:10,530 I told you I'm bad at naming things. 36 00:02:10,530 --> 00:02:12,360 We can minimize this. 37 00:02:13,720 --> 00:02:20,091 And we'll bring in our imports, from urllib.request, 38 00:02:20,091 --> 00:02:25,818 we want urlopen, from bs4 import BeautifulSoup. 39 00:02:28,426 --> 00:02:30,970 And we'll be using regular expressions. 40 00:02:30,970 --> 00:02:33,032 So let's import re to take care of that. 41 00:02:36,281 --> 00:02:43,912 Let's make an internal links function that will take a link URL, internal_links. 42 00:02:47,054 --> 00:02:51,654 We'll need to open our URL to define our html urlopen. 43 00:02:53,460 --> 00:02:56,890 Inside here, we'll pass in the start of the URL and 44 00:02:56,890 --> 00:03:01,560 format it with the internal URL we scrape from the page. 45 00:03:01,560 --> 00:03:04,960 For our URL in our case, 46 00:03:04,960 --> 00:03:12,956 treehouse-projects.github.io/horse-land and 47 00:03:12,956 --> 00:03:16,363 our string formatter. 48 00:03:19,505 --> 00:03:21,650 And we'll format it with a linkURL. 49 00:03:23,800 --> 00:03:26,500 Next, we create our Beautiful Soup object. 50 00:03:27,830 --> 00:03:32,040 Soup is BeautifulSoup, pass in our html, and 51 00:03:32,040 --> 00:03:36,044 we'll use the same html parser we've been using, html.parser. 52 00:03:37,430 --> 00:03:39,715 And we want to return the link of the soup object, 53 00:03:39,715 --> 00:03:44,810 soup.find, and we want the anchor links. 54 00:03:44,810 --> 00:03:50,145 We'll look for the anchor tags and use the HREF attribute of the find method 55 00:03:50,145 --> 00:03:56,100 with a regular expression to just get the links that, in our case, end in .html. 56 00:03:56,100 --> 00:04:02,190 It's inside here, re.compile, 57 00:04:02,190 --> 00:04:06,260 our pattern is .html. 58 00:04:06,260 --> 00:04:07,350 Let's put it to use. 59 00:04:09,230 --> 00:04:13,680 So if dunder name, equals dunder main, 60 00:04:16,280 --> 00:04:20,550 we want our urls to be in internal_links. 61 00:04:20,550 --> 00:04:24,040 And we'll pass in our starting URL to the internal_links method, 62 00:04:24,040 --> 00:04:28,380 and in our case, it's index.html. 63 00:04:28,380 --> 00:04:29,310 And then we'll do a while loop. 64 00:04:31,020 --> 00:04:33,979 So we'll have a length of our urls is greater than 0. 65 00:04:36,435 --> 00:04:38,853 We want to capture the URL href. 66 00:04:44,091 --> 00:04:46,699 Now we can do a lot of processing here, but for 67 00:04:46,699 --> 00:04:50,900 now let's just print out the page information we get, print[page]. 68 00:04:50,900 --> 00:04:52,940 And then we'll add a little bit of formatting. 69 00:04:57,890 --> 00:05:00,371 Couple new lines in there, and 70 00:05:00,371 --> 00:05:06,008 then we'll call our internal links method again, for the next link. 71 00:05:09,224 --> 00:05:16,992 Internal_links(page), let's run it and see it in action. 72 00:05:22,703 --> 00:05:24,900 Well, there we have it. 73 00:05:24,900 --> 00:05:29,095 It's doing what we asked, but it's in an infinite loop. 74 00:05:29,095 --> 00:05:34,543 Index.html is finally linked to mustang.html, which is finding 75 00:05:34,543 --> 00:05:40,000 the link back to index.html, which is, well, you get the point. 76 00:05:40,000 --> 00:05:48,127 Let's add in the list to keep track of our pages Call it site_links. 77 00:05:52,754 --> 00:05:54,500 And then we'll adjust our while loop. 78 00:05:56,630 --> 00:06:00,240 So if page not in site_links. 79 00:06:00,240 --> 00:06:04,755 And then we'll add the pages to our list, 80 00:06:04,755 --> 00:06:08,500 site_links.append(page). 81 00:06:08,500 --> 00:06:10,940 We can indent all that. 82 00:06:12,540 --> 00:06:13,531 Give us some more space. 83 00:06:16,375 --> 00:06:19,620 So otherwise, we'll just break. 84 00:06:19,620 --> 00:06:20,410 And let's run it again. 85 00:06:21,980 --> 00:06:23,120 Page is not defined. 86 00:06:25,218 --> 00:06:26,895 And pull that out. 87 00:06:30,574 --> 00:06:32,270 Started my if statement too soon. 88 00:06:34,470 --> 00:06:37,440 There we go, and we get the links that we were expecting. 89 00:06:37,440 --> 00:06:42,940 External links are handled in a similar fashion, you do find the base url path, 90 00:06:42,940 --> 00:06:48,490 and then, with regex define the pattern you're looking for and follow the links. 91 00:06:48,490 --> 00:06:52,000 I'll saddle you with the responsibility to give it a try and 92 00:06:52,000 --> 00:06:54,420 post your solution in the community. 93 00:06:54,420 --> 00:06:56,570 Don't worry, I'm sure you can rein it in.