1 00:00:00,400 --> 00:00:04,350 Beautiful soup, so rich and green, waiting in a hot tureen. 2 00:00:04,350 --> 00:00:06,710 Who for such dainties would not stoop? 3 00:00:06,710 --> 00:00:09,000 Soup of the evening, beautiful soup. 4 00:00:09,000 --> 00:00:10,890 Soup of the evening, beautiful soup. 5 00:00:12,360 --> 00:00:16,680 This is the start of a song the mock turtle sings in Lewis Carroll's book, 6 00:00:16,680 --> 00:00:19,160 Alice's Adventures in Wonderland. 7 00:00:19,160 --> 00:00:22,190 I couldn't beat Gene Wilder's singing of it, so I didn't even try. 8 00:00:23,230 --> 00:00:28,500 It's also where the HTML parsing package Beautiful Soup gets its name. 9 00:00:28,500 --> 00:00:31,550 It's designed to scrape web pages, and provides a tool kit for 10 00:00:31,550 --> 00:00:34,300 dissecting a document, and extracting what you need. 11 00:00:36,030 --> 00:00:39,900 One of the features that Beautiful Soup provides is the ability to utilize 12 00:00:39,900 --> 00:00:44,020 different parsers to create the Python object version of the page. 13 00:00:44,020 --> 00:00:46,040 There are some that are faster than others, and 14 00:00:46,040 --> 00:00:51,070 some that are better at transforming the messy HTML pages we looked at earlier. 15 00:00:51,070 --> 00:00:54,380 We'll be using a good middle of the road parser right now, but 16 00:00:54,380 --> 00:00:56,670 check the teacher's notes for other popular options. 17 00:00:58,180 --> 00:01:01,690 Let's see how to get Beautiful Soup set up and use it to parse a webpage. 18 00:01:03,540 --> 00:01:06,290 Let's head into our IDE and get started. 19 00:01:06,290 --> 00:01:08,390 I'm using PyCharm. 20 00:01:08,390 --> 00:01:10,750 So first, we'll need to install Beautiful Soup. 21 00:01:12,500 --> 00:01:13,826 Do Preferences. 22 00:01:15,682 --> 00:01:16,920 We want to install a package. 23 00:01:18,320 --> 00:01:24,560 We want to look for beautifulsoup4, And install the package. 24 00:01:28,680 --> 00:01:34,820 If you're in a different IDE, you can also use tools such as PIP or PIPEnv. 25 00:01:34,820 --> 00:01:38,640 If you aren't familiar with PIPEnv, it's similar to PIP but 26 00:01:38,640 --> 00:01:40,490 offers some additional features. 27 00:01:40,490 --> 00:01:42,210 Check the teacher's notes for more information. 28 00:01:42,210 --> 00:01:45,550 And now we want to put it to use in a new file. 29 00:01:45,550 --> 00:01:47,945 Let's call it scraper.py. 30 00:01:50,733 --> 00:01:53,558 Scraper.py, and we'll do our imports. 31 00:01:53,558 --> 00:01:59,240 So from urllib.request, we'll import urlopen. 32 00:02:00,280 --> 00:02:03,450 This will handle the server request to our URL. 33 00:02:06,505 --> 00:02:10,460 From bs4, we want to import BeautifulSoup. 34 00:02:13,307 --> 00:02:19,547 Next, we pass in the URL of the URL we want to scrape into the urlopen method. 35 00:02:21,771 --> 00:02:27,232 We'll assign it a variable called html 36 00:02:27,232 --> 00:02:32,859 = urlopen, and our site's URL, which 37 00:02:32,859 --> 00:02:40,139 is https://treehouse-projects.github.io/hors- 38 00:02:40,139 --> 00:02:44,800 e-land/index.html. 39 00:02:44,800 --> 00:02:47,350 Then we create our Beautiful Soup object. 40 00:02:49,240 --> 00:02:52,645 Call it soup = BeautifulSoup. 41 00:02:52,645 --> 00:02:58,630 In here, we pass in the HTML, and call read to read it. 42 00:02:58,630 --> 00:03:02,550 So html.read, then we pass in our parser. 43 00:03:03,560 --> 00:03:09,565 In our case, we're using the html.parser that's included with Python 3, 44 00:03:09,565 --> 00:03:13,180 html.parser. 45 00:03:13,180 --> 00:03:15,930 And giddy up, we're set to read our page. 46 00:03:15,930 --> 00:03:17,440 Let's print it out to see what we get. 47 00:03:19,220 --> 00:03:21,640 We'll print(soup) and run our script. 48 00:03:26,648 --> 00:03:32,110 Great, that works, but nothing is indented like it's supposed to be in HTML. 49 00:03:32,110 --> 00:03:35,050 We can do better with the prettify method. 50 00:03:35,050 --> 00:03:35,710 We do soup. 51 00:03:40,691 --> 00:03:41,390 Prettify, and rerun it. 52 00:03:45,247 --> 00:03:48,975 That's much better and easier to read, which is what prettify does. 53 00:03:48,975 --> 00:03:51,520 It simply makes things easier to read. 54 00:03:53,050 --> 00:03:54,970 Now, did you notice something in here? 55 00:03:54,970 --> 00:03:57,040 Our image gallery isn't being displayed. 56 00:04:01,259 --> 00:04:04,953 We have our unordered list with our ID of image gallery, 57 00:04:04,953 --> 00:04:10,440 which from a previous video we know contains all of the images of our horses. 58 00:04:10,440 --> 00:04:14,180 Here in the HTML though, we're not seeing the list items, 59 00:04:14,180 --> 00:04:17,940 our images are being populated using some JavaScript. 60 00:04:17,940 --> 00:04:23,030 Beautiful Soup doesn't wait for JavaScript to run before it scrapes a page. 61 00:04:23,030 --> 00:04:26,280 We'll see how to handle these situations in a little bit. 62 00:04:26,280 --> 00:04:30,450 For now, let's look at additional Beautiful Soup features. 63 00:04:30,450 --> 00:04:34,146 We can drill down to get specific pieces of the site, like the page title. 64 00:04:36,343 --> 00:04:42,110 We'll do soup.title There it is. 65 00:04:42,110 --> 00:04:45,016 How about a page on it, like a div? 66 00:04:45,016 --> 00:04:45,792 soup.div. 67 00:04:48,417 --> 00:04:52,530 Well, shucks, that only gets us the first div on the page. 68 00:04:52,530 --> 00:04:55,360 Let's get them all and loop through them to print them out. 69 00:04:55,360 --> 00:04:59,550 There is a find all method that allows us to easily do that. 70 00:04:59,550 --> 00:05:00,345 So come up here. 71 00:05:00,345 --> 00:05:05,225 Say divs = soup.find_all, and 72 00:05:05,225 --> 00:05:10,783 we want div elements, for div in divs. 73 00:05:15,602 --> 00:05:16,770 We want to print our div. 74 00:05:22,541 --> 00:05:26,440 And run it again, there's our divs. 75 00:05:26,440 --> 00:05:30,510 We can filter some of these out by passing in class values. 76 00:05:30,510 --> 00:05:34,320 Let's just get the one that has this featured class name. 77 00:05:34,320 --> 00:05:37,470 We come back here to our website and do the developer tools. 78 00:05:38,650 --> 00:05:41,840 Featured section here is the one with a horse of the month. 79 00:05:41,840 --> 00:05:48,660 So here, we'll pass in, class, and we want featured. 80 00:05:53,689 --> 00:05:58,720 There, now we only have classes that have featured in the name. 81 00:05:58,720 --> 00:06:01,850 This narrows down a specific area for us to scrape. 82 00:06:01,850 --> 00:06:06,600 We'll explore more of this find all and its related record find in the next video.