More Soup in the Tureen6:18 with Ken Alger
Let's look at two Beautiful Soup methods, `find()` and `find_all()`, in greater detail.
Welcome back. 0:00 We just saw how to utilize the find all method to find 0:01 all of a particular item on the page. 0:05 We can use the find method to find the first instance of an item. 0:08 We can change this to find, get rid of our for loop here, 0:14 And run it. 0:26 I should probably have changed that name to just div, ah well. 0:30 Naming things is always a challenge for me. 0:33 We're getting all of the info back for that particular div element. 0:36 The featured one on the page here, in this case. 0:40 What if we just want the header text in here? 0:43 Since it's a child element of that div, we can chain elements together. 0:46 Let's comment this out. 0:53 Close this down. 0:57 We want featured_header = soup.find. 1:01 We want div class featured. 1:08 We just want the h2 element. 1:17 And we'll print the featured header. 1:20 Nice. 1:26 But we still have our tag elements in there. 1:27 From a data cleanliness standpoint, 1:30 it would be great if we could get rid of those, right? 1:32 Well, there's a convenient method for that called get_text. 1:35 Called get_text. 1:43 Yipee, we got some text from our site. 1:49 We scraped it out. 1:52 There's a bit of a gotcha to watch out for with this get_text method, though. 1:53 It strips away the tags from whatever we're working with, 1:57 leaving just a block of text. 2:00 Let's take away this h2 element from our text value to see what I mean. 2:02 While this is perhaps more readable for 2:12 us, it makes it much more challenging to process going forward. 2:14 If we wanted to select mustang, or 2:18 the text about them at this point, it would be more of a challenge. 2:20 The thing to remember about get_text is to use it as the last step 2:24 in the scraping process. 2:29 We've seen that the find method returns the first occurrence of an item in 2:31 a Beautiful Soup object. 2:35 It is basically a find all method with a setting of the limit of results to one. 2:36 Let's look at the parameters these methods take. 2:41 Name, which looks for tags with certain names, such as title or div. 2:45 Attrs, which allows for the searching for a specific CSS class. 2:50 We'll take a look at this here shortly. 2:54 Recursive, by default, find and find all examines all descendants of a tag. 2:57 If we set recursity over false, 3:03 it will only look at the direct children of the tag. 3:05 String or text allows for the searching of strings instead of tags. 3:09 Kwargs, which allows researching on other items, such as CSS ID. 3:15 Limit, the find all method also accepts a limit argument 3:20 to limit the results that return. 3:24 As I mentioned, find is a find all with a limit set to one. 3:26 We can pass in a string, a list, a regular expression, 3:30 a value equals true, or even a function to the name, string, or 3:34 kwargs arguments to further enhance the searching capabilities. 3:38 Let's take a look at the attrs argument to search for the CSS class or print out 3:43 all references to this primary button class, which is this button down here. 3:48 Come back over here to our code, let's comment this out. 3:54 So for button in soup.find. 4:00 Gonna look for a class, and 4:07 that class was button button--primary. 4:10 And we'll just print the buttons out. 4:20 And more, here it is. 4:27 Since class is a reserved word in Python and searching for items on page based 4:29 on class is a frequent task, Beautiful Soup provides a process for that. 4:34 We can change our code to use a special keyword argument, class underscore. 4:39 So we can take all this out, remove our closing curly bracket, 4:44 and we get the same result with a bit less typing. 4:55 Another very common task which will be useful when we want to move 4:58 from one page to another is to get all of the hyperlinks on a page. 5:02 We can navigate into a specific tag and 5:07 use the get method to obstruct specific information. 5:09 Minimize that. 5:13 Again, we'll comment this out, just for clarity. 5:17 So for link in soup.find all, so we'll look for all the anchor elements. 5:21 And then we'll print out all of the href attributes. 5:29 So link and we'll get the hrefs. 5:34 We can look at these patterns to determine internal and external links. 5:42 Definitely, a handy thing to do. 5:47 Beautiful Soup is a very powerful tool, and 5:50 we've just scratched the surface of its power. 5:52 But we've seen how we can use Python to read a webpage and 5:55 get very specific data from the HTML. 5:58 It can take a bit of work to decipher the page structure, but 6:01 that is time well spent for data collection. 6:04 Before we get too much further into collecting data from websites, 6:08 we should talk about some other things to think about to be good data wranglers. 6:12 I'll see you all back here in a bit and have a look. 6:16
You need to sign up for Treehouse in order to download course files.Sign up