Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Introducing the Python web scraping package, Beautiful Soup.
Additional Resources
- pipenv Workshop
- Beautiful Soup web site
- Alice's Adventures in Wonderland by Lewis Carroll
Parsers
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Beautiful soup, so rich and
green, waiting in a hot tureen.
0:00
Who for such dainties would not stoop?
0:04
Soup of the evening, beautiful soup.
0:06
Soup of the evening, beautiful soup.
0:09
This is the start of a song the mock
turtle sings in Lewis Carroll's book,
0:12
Alice's Adventures in Wonderland.
0:16
I couldn't beat Gene Wilder's
singing of it, so I didn't even try.
0:19
It's also where the HTML parsing
package Beautiful Soup gets its name.
0:23
It's designed to scrape web pages,
and provides a tool kit for
0:28
dissecting a document, and
extracting what you need.
0:31
One of the features that Beautiful Soup
provides is the ability to utilize
0:36
different parsers to create the Python
object version of the page.
0:39
There are some that
are faster than others, and
0:44
some that are better at transforming
the messy HTML pages we looked at earlier.
0:46
We'll be using a good middle of
the road parser right now, but
0:51
check the teacher's notes for
other popular options.
0:54
Let's see how to get Beautiful Soup
set up and use it to parse a webpage.
0:58
Let's head into our IDE and get started.
1:03
I'm using PyCharm.
1:06
So first,
we'll need to install Beautiful Soup.
1:08
Do Preferences.
1:12
We want to install a package.
1:15
We want to look for beautifulsoup4,
And install the package.
1:18
If you're in a different IDE, you can
also use tools such as PIP or PIPEnv.
1:28
If you aren't familiar with PIPEnv,
it's similar to PIP but
1:34
offers some additional features.
1:38
Check the teacher's notes for
more information.
1:40
And now we want to put
it to use in a new file.
1:42
Let's call it scraper.py.
1:45
Scraper.py, and we'll do our imports.
1:50
So from urllib.request,
we'll import urlopen.
1:53
This will handle the server
request to our URL.
2:00
From bs4, we want to import BeautifulSoup.
2:06
Next, we pass in the URL of the URL we
want to scrape into the urlopen method.
2:13
We'll assign it a variable called html
2:21
= urlopen, and our site's URL, which
2:27
is
https://treehouse-projects.github.io/hors-
2:32
e-land/index.html.
2:40
Then we create our Beautiful Soup object.
2:44
Call it soup = BeautifulSoup.
2:49
In here, we pass in the HTML,
and call read to read it.
2:52
So html.read, then we pass in our parser.
2:58
In our case, we're using the html.parser
that's included with Python 3,
3:03
html.parser.
3:09
And giddy up, we're set to read our page.
3:13
Let's print it out to see what we get.
3:15
We'll print(soup) and run our script.
3:19
Great, that works, but nothing is indented
like it's supposed to be in HTML.
3:26
We can do better with the prettify method.
3:32
We do soup.
3:35
Prettify, and rerun it.
3:40
That's much better and easier to read,
which is what prettify does.
3:45
It simply makes things easier to read.
3:48
Now, did you notice something in here?
3:53
Our image gallery isn't being displayed.
3:54
We have our unordered list
with our ID of image gallery,
4:01
which from a previous video we know
contains all of the images of our horses.
4:04
Here in the HTML though,
we're not seeing the list items,
4:10
our images are being populated
using some JavaScript.
4:14
Beautiful Soup doesn't wait for JavaScript
to run before it scrapes a page.
4:17
We'll see how to handle these
situations in a little bit.
4:23
For now, let's look at additional
Beautiful Soup features.
4:26
We can drill down to get specific pieces
of the site, like the page title.
4:30
We'll do soup.title There it is.
4:36
How about a page on it, like a div?
4:42
soup.div.
4:45
Well, shucks, that only gets
us the first div on the page.
4:48
Let's get them all and
loop through them to print them out.
4:52
There is a find all method that
allows us to easily do that.
4:55
So come up here.
4:59
Say divs = soup.find_all, and
5:00
we want div elements, for div in divs.
5:05
We want to print our div.
5:15
And run it again, there's our divs.
5:22
We can filter some of these out
by passing in class values.
5:26
Let's just get the one that
has this featured class name.
5:30
We come back here to our website and
do the developer tools.
5:34
Featured section here is the one
with a horse of the month.
5:38
So here, we'll pass in,
class, and we want featured.
5:41
There, now we only have classes
that have featured in the name.
5:53
This narrows down a specific area for
us to scrape.
5:58
We'll explore more of this find all and
its related record find in the next video.
6:01
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up