Web Page Anatomy3:47 with Ken Alger
Let's take a brief look at how an HTML page is structured so we can better understand how to navigate a page for web scraping.
Before we jump into Python and start wrangling data from a web page, 0:00 I think it will be helpful to revisit what a web page looks like in code. 0:03 How is a web page structured, or more specifically, 0:07 how a web page should be structured. 0:10 In your journey with web scraping, you'll likely come across a site or 0:13 two where you ask, hold your horses, why aren't any of the tags closed? 0:17 Or, seriously, there are five h1 tags on this page? 0:21 If we look at how an HTML page should be structured, it starts and 0:26 ends with an opening and closing html tag. 0:30 Inside the html tag, we have a head section which has tags for 0:33 metadata about the page and other essential information for the document. 0:38 The page title will be found in here as well. 0:41 Next, we have the body section where the content of the page is found. 0:44 Inside here is where we'll do the majority or our scraping. 0:48 Things like heading tags, div, paragraph, anchor, and 0:51 form elements will reside inside here. 0:56 I mentioned that structure is how a page should look, 0:59 sometimes reality is different. 1:02 Let's take a look at how lenient HTML can be written and 1:04 still look good in the browser. 1:07 This will point out some of the challenges and 1:09 benefits that we can come into when attempting to scrape a site. 1:11 Let's take a look at a sample website that the amazing design team 1:16 here at Treehouse put together. 1:19 It's hosted on GitHub Pages, which is great 1:21 because it allows us to view the site and easily see the HTML code. 1:24 Check the teacher's notes for the link. 1:29 I'm using the Chrome browser, and 1:32 if we open up the developer's tools with Option+Cmd+I on a Mac, or 1:33 Ctrl+Shift+I on Windows, we can examine the structure of our page. 1:37 Here at the top, we see the head section, and 1:43 can expand that to see that it contains a few things. 1:45 There's some metadata, there's links to our style sheet and fonts, and 1:48 there it is, our page title. 1:52 We'll see how to scrape that information in code here shortly. 1:54 The body section is where, as I mentioned, 1:59 we'll find most of the interesting items we'll want to scrape. 2:01 We see that we have a few different div elements that separate the page 2:04 into different logical components. 2:08 Such as the graphical header, there's our featured image, 2:11 and then down here, there's the links at the bottom of the page. 2:15 The main portion of this particular webpage is the list of horses with 2:17 the images. 2:22 We see here in the HTML that they all reside here in this unordered list 2:23 section, with the imageGallery ID and card-wrap class. 2:28 If we expand this section, we see a bunch of list items. 2:33 These look like potential scraping targets, and 2:37 we'll explore them more specifically, later in the course. 2:39 One thing I do want to mention here is that modern web browsers can hide a lot of 2:43 HTML errors for us. 2:48 Inline elements such as span, and some block level elements such as paragraph 2:49 tags may not be closed in the actual HTML, but the browser closes then for us. 2:55 If we take a look here, Wwe see this paragraph here at the bottom. 3:01 We see that it has a class of credits, and there's an opening and closing p tag. 3:07 However, if we look at the source code for this file on GitHub, 3:11 that's down here under index.html. 3:15 So in here, we scroll down to the bottom of the page. 3:19 We see the opening p tag on line 43, but 3:23 there isn't a closing tag when this paragraph ends on line 46. 3:25 In this case, the browser helps us out for web scraping tasks. 3:30 Fortunately, HTML doesn't have to be perfect. 3:34 With some web page anatomy under our belts, let's take a quick pit stop before 3:37 we get started with some scraping tasks with the Python package, Beautiful Soup. 3:42
You need to sign up for Treehouse in order to download course files.Sign up