Preview

Start a free Courses trial
to watch this video

Web Page Anatomy

Name: Web Page Anatomy
Uploaded: 2018-08-22T07:35:45-04:00
Duration: 227 s
Description: Let's take a brief look at how an HTML page is structured so we can better understand how to navigate a page for web scraping.

3:47 with Ken Alger

Let's take a brief look at how an HTML page is structured so we can better understand how to navigate a page for web scraping.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Horse Land web site
Horse Land site source code

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Before we jump into Python and start wrangling data from a web page, 0:00

I think it will be helpful to revisit what a web page looks like in code. 0:03

How is a web page structured, or more specifically, 0:07

how a web page should be structured. 0:10

In your journey with web scraping, you'll likely come across a site or 0:13

two where you ask, hold your horses, why aren't any of the tags closed? 0:17

Or, seriously, there are five h1 tags on this page? 0:21

If we look at how an HTML page should be structured, it starts and 0:26

ends with an opening and closing html tag. 0:30

Inside the html tag, we have a head section which has tags for 0:33

metadata about the page and other essential information for the document. 0:38

The page title will be found in here as well. 0:41

Next, we have the body section where the content of the page is found. 0:44

Inside here is where we'll do the majority or our scraping. 0:48

Things like heading tags, div, paragraph, anchor, and 0:51

form elements will reside inside here. 0:56

I mentioned that structure is how a page should look, 0:59

sometimes reality is different. 1:02

Let's take a look at how lenient HTML can be written and 1:04

still look good in the browser. 1:07

This will point out some of the challenges and 1:09

benefits that we can come into when attempting to scrape a site. 1:11

Let's take a look at a sample website that the amazing design team 1:16

here at Treehouse put together. 1:19

It's hosted on GitHub Pages, which is great 1:21

because it allows us to view the site and easily see the HTML code. 1:24

Check the teacher's notes for the link. 1:29

I'm using the Chrome browser, and 1:32

if we open up the developer's tools with Option+Cmd+I on a Mac, or 1:33

Ctrl+Shift+I on Windows, we can examine the structure of our page. 1:37

Here at the top, we see the head section, and 1:43

can expand that to see that it contains a few things. 1:45

There's some metadata, there's links to our style sheet and fonts, and 1:48

there it is, our page title. 1:52

We'll see how to scrape that information in code here shortly. 1:54

The body section is where, as I mentioned, 1:59

we'll find most of the interesting items we'll want to scrape. 2:01

We see that we have a few different div elements that separate the page 2:04

into different logical components. 2:08

Such as the graphical header, there's our featured image, 2:11

and then down here, there's the links at the bottom of the page. 2:15

The main portion of this particular webpage is the list of horses with 2:17

the images. 2:22

We see here in the HTML that they all reside here in this unordered list 2:23

section, with the imageGallery ID and card-wrap class. 2:28

If we expand this section, we see a bunch of list items. 2:33

These look like potential scraping targets, and 2:37

we'll explore them more specifically, later in the course. 2:39

One thing I do want to mention here is that modern web browsers can hide a lot of 2:43

HTML errors for us. 2:48

Inline elements such as span, and some block level elements such as paragraph 2:49

tags may not be closed in the actual HTML, but the browser closes them for us. 2:55

If we take a look here, we see this paragraph here at the bottom. 3:01

We see that it has a class of credits, and there's an opening and closing p tag. 3:07

However, if we look at the source code for this file on GitHub, 3:11

that's down here under index.html. 3:15

So in here, we scroll down to the bottom of the page. 3:19

We see the opening p tag on line 43, but 3:23

there isn't a closing tag when this paragraph ends on line 46. 3:25

In this case, the browser helps us out for web scraping tasks. 3:30

Fortunately, HTML doesn't have to be perfect. 3:34

With some web page anatomy under our belts, let's take a quick pit stop before 3:37

we get started with some scraping tasks with the Python package, Beautiful Soup. 3:42

You need to sign up for Treehouse in order to download course files.

You need to sign up for Treehouse in order to set up Workspace

Scraping Data From the Web

Web Page Anatomy

Related Discussions

Related Discussions