Join our Live Session and learn more about College Credits! 👩‍🎓 Register here!

🤑 Join the Treehouse affiliate program and earn 25% commission!

✨ No-code curious? Check out 4 new FREE Adalo courses and start building an app in minutes — no code required!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

Well done!

You have completed Scraping Data From the Web!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Beautiful Soup

6:07 with Ken Alger

Introducing the Python web scraping package, Beautiful Soup.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Additional Resources

pipenv Workshop
Beautiful Soup web site
Alice's Adventures in Wonderland by Lewis Carroll

Parsers

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Beautiful soup, so rich and green, waiting in a hot tureen. 0:00

Who for such dainties would not stoop? 0:04

Soup of the evening, beautiful soup. 0:06

Soup of the evening, beautiful soup. 0:09

This is the start of a song the mock turtle sings in Lewis Carroll's book, 0:12

Alice's Adventures in Wonderland. 0:16

I couldn't beat Gene Wilder's singing of it, so I didn't even try. 0:19

It's also where the HTML parsing package Beautiful Soup gets its name. 0:23

It's designed to scrape web pages, and provides a tool kit for 0:28

dissecting a document, and extracting what you need. 0:31

One of the features that Beautiful Soup provides is the ability to utilize 0:36

different parsers to create the Python object version of the page. 0:39

There are some that are faster than others, and 0:44

some that are better at transforming the messy HTML pages we looked at earlier. 0:46

We'll be using a good middle of the road parser right now, but 0:51

check the teacher's notes for other popular options. 0:54

Let's see how to get Beautiful Soup set up and use it to parse a webpage. 0:58

Let's head into our IDE and get started. 1:03

I'm using PyCharm. 1:06

So first, we'll need to install Beautiful Soup. 1:08

Do Preferences. 1:12

We want to install a package. 1:15

We want to look for beautifulsoup4, And install the package. 1:18

If you're in a different IDE, you can also use tools such as PIP or PIPEnv. 1:28

If you aren't familiar with PIPEnv, it's similar to PIP but 1:34

offers some additional features. 1:38

Check the teacher's notes for more information. 1:40

And now we want to put it to use in a new file. 1:42

Let's call it scraper.py. 1:45

Scraper.py, and we'll do our imports. 1:50

So from urllib.request, we'll import urlopen. 1:53

This will handle the server request to our URL. 2:00

From bs4, we want to import BeautifulSoup. 2:06

Next, we pass in the URL of the URL we want to scrape into the urlopen method. 2:13

We'll assign it a variable called html 2:21

= urlopen, and our site's URL, which 2:27

is https://treehouse-projects.github.io/hors- 2:32

e-land/index.html. 2:40

Then we create our Beautiful Soup object. 2:44

Call it soup = BeautifulSoup. 2:49

In here, we pass in the HTML, and call read to read it. 2:52

So html.read, then we pass in our parser. 2:58

In our case, we're using the html.parser that's included with Python 3, 3:03

html.parser. 3:09

And giddy up, we're set to read our page. 3:13

Let's print it out to see what we get. 3:15

We'll print(soup) and run our script. 3:19

Great, that works, but nothing is indented like it's supposed to be in HTML. 3:26

We can do better with the prettify method. 3:32

We do soup. 3:35

Prettify, and rerun it. 3:40

That's much better and easier to read, which is what prettify does. 3:45

It simply makes things easier to read. 3:48

Now, did you notice something in here? 3:53

Our image gallery isn't being displayed. 3:54

We have our unordered list with our ID of image gallery, 4:01

which from a previous video we know contains all of the images of our horses. 4:04

Here in the HTML though, we're not seeing the list items, 4:10

our images are being populated using some JavaScript. 4:14

Beautiful Soup doesn't wait for JavaScript to run before it scrapes a page. 4:17

We'll see how to handle these situations in a little bit. 4:23

For now, let's look at additional Beautiful Soup features. 4:26

We can drill down to get specific pieces of the site, like the page title. 4:30

We'll do soup.title There it is. 4:36

How about a page on it, like a div? 4:42

soup.div. 4:45

Well, shucks, that only gets us the first div on the page. 4:48

Let's get them all and loop through them to print them out. 4:52

There is a find all method that allows us to easily do that. 4:55

So come up here. 4:59

Say divs = soup.find_all, and 5:00

we want div elements, for div in divs. 5:05

We want to print our div. 5:15

And run it again, there's our divs. 5:22

We can filter some of these out by passing in class values. 5:26

Let's just get the one that has this featured class name. 5:30

We come back here to our website and do the developer tools. 5:34

Featured section here is the one with a horse of the month. 5:38

So here, we'll pass in, class, and we want featured. 5:41

There, now we only have classes that have featured in the name. 5:53

This narrows down a specific area for us to scrape. 5:58

We'll explore more of this find all and its related record find in the next video. 6:01

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up