1 00:00:00,370 --> 00:00:01,590 Welcome back. 2 00:00:01,590 --> 00:00:05,560 We just saw how to utilize the find all method to find 3 00:00:05,560 --> 00:00:08,510 all of a particular item on the page. 4 00:00:08,510 --> 00:00:12,920 We can use the find method to find the first instance of an item. 5 00:00:14,730 --> 00:00:19,643 We can change this to find, get rid of our for loop here, 6 00:00:26,682 --> 00:00:30,040 And run it. 7 00:00:30,040 --> 00:00:33,810 I should probably have changed that name to just div, ah well. 8 00:00:33,810 --> 00:00:36,230 Naming things is always a challenge for me. 9 00:00:36,230 --> 00:00:39,130 We're getting all of the info back for that particular div element. 10 00:00:40,160 --> 00:00:43,390 The featured one on the page here, in this case. 11 00:00:43,390 --> 00:00:46,650 What if we just want the header text in here? 12 00:00:46,650 --> 00:00:53,069 Since it's a child element of that div, we can chain elements together. 13 00:00:53,069 --> 00:00:54,329 Let's comment this out. 14 00:00:57,340 --> 00:00:58,160 Close this down. 15 00:01:01,070 --> 00:01:08,367 We want featured_header = soup.find. 16 00:01:08,367 --> 00:01:13,840 We want div class featured. 17 00:01:17,090 --> 00:01:18,560 We just want the h2 element. 18 00:01:20,829 --> 00:01:22,286 And we'll print the featured header. 19 00:01:26,595 --> 00:01:27,660 Nice. 20 00:01:27,660 --> 00:01:30,030 But we still have our tag elements in there. 21 00:01:30,030 --> 00:01:32,560 From a data cleanliness standpoint, 22 00:01:32,560 --> 00:01:35,600 it would be great if we could get rid of those, right? 23 00:01:35,600 --> 00:01:39,535 Well, there's a convenient method for that called get_text. 24 00:01:43,682 --> 00:01:45,011 Called get_text. 25 00:01:49,360 --> 00:01:52,240 Yipee, we got some text from our site. 26 00:01:52,240 --> 00:01:53,810 We scraped it out. 27 00:01:53,810 --> 00:01:57,660 There's a bit of a gotcha to watch out for with this get_text method, though. 28 00:01:57,660 --> 00:02:00,840 It strips away the tags from whatever we're working with, 29 00:02:00,840 --> 00:02:02,830 leaving just a block of text. 30 00:02:02,830 --> 00:02:07,069 Let's take away this h2 element from our text value to see what I mean. 31 00:02:12,180 --> 00:02:14,406 While this is perhaps more readable for 32 00:02:14,406 --> 00:02:18,530 us, it makes it much more challenging to process going forward. 33 00:02:18,530 --> 00:02:20,930 If we wanted to select mustang, or 34 00:02:20,930 --> 00:02:24,870 the text about them at this point, it would be more of a challenge. 35 00:02:24,870 --> 00:02:29,580 The thing to remember about get_text is to use it as the last step 36 00:02:29,580 --> 00:02:31,690 in the scraping process. 37 00:02:31,690 --> 00:02:35,060 We've seen that the find method returns the first occurrence of an item in 38 00:02:35,060 --> 00:02:36,920 a Beautiful Soup object. 39 00:02:36,920 --> 00:02:41,960 It is basically a find all method with a setting of the limit of results to one. 40 00:02:41,960 --> 00:02:45,030 Let's look at the parameters these methods take. 41 00:02:45,030 --> 00:02:49,230 Name, which looks for tags with certain names, such as title or div. 42 00:02:50,350 --> 00:02:54,750 Attrs, which allows for the searching for a specific CSS class. 43 00:02:54,750 --> 00:02:57,230 We'll take a look at this here shortly. 44 00:02:57,230 --> 00:03:03,500 Recursive, by default, find and find all examines all descendants of a tag. 45 00:03:03,500 --> 00:03:05,328 If we set recursity over false, 46 00:03:05,328 --> 00:03:08,250 it will only look at the direct children of the tag. 47 00:03:09,310 --> 00:03:13,880 String or text allows for the searching of strings instead of tags. 48 00:03:15,120 --> 00:03:20,120 Kwargs, which allows researching on other items, such as CSS ID. 49 00:03:20,120 --> 00:03:24,330 Limit, the find all method also accepts a limit argument 50 00:03:24,330 --> 00:03:26,650 to limit the results that return. 51 00:03:26,650 --> 00:03:30,960 As I mentioned, find is a find all with a limit set to one. 52 00:03:30,960 --> 00:03:34,570 We can pass in a string, a list, a regular expression, 53 00:03:34,570 --> 00:03:38,600 a value equals true, or even a function to the name, string, or 54 00:03:38,600 --> 00:03:42,370 kwargs arguments to further enhance the searching capabilities. 55 00:03:43,790 --> 00:03:48,600 Let's take a look at the attrs argument to search for the CSS class or print out 56 00:03:48,600 --> 00:03:54,440 all references to this primary button class, which is this button down here. 57 00:03:54,440 --> 00:03:57,180 Come back over here to our code, let's comment this out. 58 00:04:00,580 --> 00:04:03,622 So for button in soup.find. 59 00:04:07,140 --> 00:04:10,100 Gonna look for a class, and 60 00:04:10,100 --> 00:04:14,876 that class was button button--primary. 61 00:04:20,240 --> 00:04:21,591 And we'll just print the buttons out. 62 00:04:27,716 --> 00:04:29,290 And more, here it is. 63 00:04:29,290 --> 00:04:34,530 Since class is a reserved word in Python and searching for items on page based 64 00:04:34,530 --> 00:04:39,700 on class is a frequent task, Beautiful Soup provides a process for that. 65 00:04:39,700 --> 00:04:44,770 We can change our code to use a special keyword argument, class underscore. 66 00:04:44,770 --> 00:04:51,640 So we can take all this out, remove our closing curly bracket, 67 00:04:55,200 --> 00:04:58,900 and we get the same result with a bit less typing. 68 00:04:58,900 --> 00:05:02,520 Another very common task which will be useful when we want to move 69 00:05:02,520 --> 00:05:07,450 from one page to another is to get all of the hyperlinks on a page. 70 00:05:07,450 --> 00:05:09,960 We can navigate into a specific tag and 71 00:05:09,960 --> 00:05:13,478 use the get method to abstract specific information. 72 00:05:13,478 --> 00:05:15,203 Minimize that. 73 00:05:17,140 --> 00:05:21,200 Again, we'll comment this out, just for clarity. 74 00:05:21,200 --> 00:05:27,700 So for link in soup.find all, so we'll look for all the anchor elements. 75 00:05:29,230 --> 00:05:32,427 And then we'll print out all of the href attributes. 76 00:05:34,630 --> 00:05:37,308 So link and we'll get the hrefs. 77 00:05:42,483 --> 00:05:47,290 We can look at these patterns to determine internal and external links. 78 00:05:47,290 --> 00:05:48,780 Definitely, a handy thing to do. 79 00:05:50,140 --> 00:05:52,710 Beautiful Soup is a very powerful tool, and 80 00:05:52,710 --> 00:05:55,550 we've just scratched the surface of its power. 81 00:05:55,550 --> 00:05:58,770 But we've seen how we can use Python to read a webpage and 82 00:05:58,770 --> 00:06:01,680 get very specific data from the HTML. 83 00:06:01,680 --> 00:06:04,760 It can take a bit of work to decipher the page structure, but 84 00:06:04,760 --> 00:06:06,950 that is time well spent for data collection. 85 00:06:08,550 --> 00:06:12,070 Before we get too much further into collecting data from websites, 86 00:06:12,070 --> 00:06:16,150 we should talk about some other things to think about to be good data wranglers. 87 00:06:16,150 --> 00:06:18,250 I'll see you all back here in a bit and have a look.