1 00:00:00,280 --> 00:00:03,110 Back in the good old days of the Internet, if we wanted data, 2 00:00:03,110 --> 00:00:05,227 we had to view it on Web pages. 3 00:00:05,227 --> 00:00:09,980 Now, however, many sites provide a Web API that shares their data. 4 00:00:11,000 --> 00:00:15,580 Sometimes, we can use these APIs to directly access information, 5 00:00:15,580 --> 00:00:16,920 without having to scrape the data. 6 00:00:18,080 --> 00:00:21,740 I'd recommend looking to see if the site you are wanting to scrape 7 00:00:21,740 --> 00:00:24,850 offers an API for the information you need. 8 00:00:24,850 --> 00:00:26,140 It can be a big time saver. 9 00:00:27,360 --> 00:00:32,551 Let's take a look at how we can get data from the World Bank, using their API. 10 00:00:32,551 --> 00:00:36,390 There are many instances when using an API is great. 11 00:00:36,390 --> 00:00:41,350 Sometimes, though, scraping results from an API is useful as well, 12 00:00:41,350 --> 00:00:45,640 especially if the API documentation isn't super helpful. 13 00:00:45,640 --> 00:00:49,020 Let's take a brief look at one technique we can use to get and 14 00:00:49,020 --> 00:00:51,380 process data from an API. 15 00:00:51,380 --> 00:00:54,770 In this case, we'll look at The World Bank API. 16 00:00:54,770 --> 00:00:59,260 It's actually very well documented, which provides us with some extra knowledge 17 00:00:59,260 --> 00:01:01,640 as we go about trying to scrape things. 18 00:01:01,640 --> 00:01:04,910 If we look here, at the Developer Information overview page, 19 00:01:04,910 --> 00:01:09,710 it provides information about how to get started, and what the API provides. 20 00:01:09,710 --> 00:01:10,750 Let's look here, 21 00:01:10,750 --> 00:01:15,460 at the Country Queries section, to see what information we might explore there. 22 00:01:15,460 --> 00:01:18,240 It looks like we could use this information to get some generic 23 00:01:18,240 --> 00:01:21,100 information about the countries of the world. 24 00:01:21,100 --> 00:01:25,700 For example, if we wanted to do some high-level data exploration about 25 00:01:25,700 --> 00:01:31,310 income level in regions of the world, let's use this request format here, 26 00:01:31,310 --> 00:01:35,650 look through some ISO codes, and get some information that we could explore. 27 00:01:35,650 --> 00:01:40,220 We won't be doing any actual exploration of data in this course, but 28 00:01:40,220 --> 00:01:42,750 check the teachers' notes for more information. 29 00:01:42,750 --> 00:01:46,470 Let's take a look at the information we get from a country with a lot of horses, 30 00:01:46,470 --> 00:01:47,760 like Ethiopia. 31 00:01:47,760 --> 00:01:53,545 I know their ISO code is ETH, so let's put that into the request format. 32 00:01:53,545 --> 00:01:58,510 So we can copy this, Let's create 33 00:01:58,510 --> 00:02:02,546 a new tab, and we'll do ETH. 34 00:02:02,546 --> 00:02:06,705 It looks like we're getting back the same information as the documentation stated, 35 00:02:06,705 --> 00:02:08,370 and it's in XML format. 36 00:02:08,370 --> 00:02:13,870 That's great, we can handle that, we'll use Beautiful Soup to parse this XML, 37 00:02:13,870 --> 00:02:16,850 and get the name, region, and income level. 38 00:02:16,850 --> 00:02:18,200 This could be used, for 39 00:02:18,200 --> 00:02:23,530 example, to generate a histogram chart of regions of the world and income levels. 40 00:02:23,530 --> 00:02:26,330 Lots of options for data visualization, here. 41 00:02:26,330 --> 00:02:29,940 Let's go back to our code, and create a new world_bank.py file. 42 00:02:31,430 --> 00:02:32,845 We don't need it inside the spider. 43 00:02:38,715 --> 00:02:42,673 world_bank.py, and we'll start with our imports. 44 00:02:42,673 --> 00:02:47,850 So, from urlib.request import urlopen. 45 00:02:47,850 --> 00:02:55,020 We're going back to Beautiful Soup, so bs4 import BeautifulSoup, and 46 00:02:55,020 --> 00:02:59,820 we'll be using a csv file of ISO codes, so we 'll want to import csv as well. 47 00:03:02,060 --> 00:03:08,367 Let's define a function to get the country information, get_country, 48 00:03:08,367 --> 00:03:13,082 and we'll pass in our country code, and 49 00:03:13,082 --> 00:03:17,060 just like we've done with Beautiful Soup in the past, we define our HTML string. 50 00:03:19,937 --> 00:03:23,792 It's urlopen, and 51 00:03:23,792 --> 00:03:28,906 it's that request format string that we saw just a moment ago, 52 00:03:28,906 --> 00:03:35,294 worldbank.org/v2/countries/, and we'll use the string formatter, 53 00:03:39,364 --> 00:03:41,790 country_code, and let's bring this down to a new line. 54 00:03:44,713 --> 00:03:48,801 Next, we define our soup object. 55 00:03:48,801 --> 00:03:52,031 So, we pass in our HTML, and for our parser, 56 00:03:52,031 --> 00:03:56,115 since we're dealing with XML, we can use an XML parser. 57 00:03:58,672 --> 00:04:02,920 Scraping XML is pretty straightforward with Beautiful Soup. 58 00:04:02,920 --> 00:04:07,355 If we look at the results we got for Ethiopia, we want to get three fields, 59 00:04:07,355 --> 00:04:13,890 wb:name, wb:region, and the wb:incomeLevel. 60 00:04:13,890 --> 00:04:15,160 Let's go ahead and define those. 61 00:04:16,480 --> 00:04:23,442 Country_name is soup.find( 'wb:name' ), 62 00:04:26,091 --> 00:04:31,903 Region, ( 'wb:region' ), 63 00:04:31,903 --> 00:04:36,955 and income_level, soup.find( 64 00:04:36,955 --> 00:04:41,890 'wb:incomelevel' ), and it was all lowercase. 65 00:04:43,300 --> 00:04:45,270 Now, let's print that information out. 66 00:04:45,270 --> 00:04:49,144 Here's a good example of a time when we can use the get_text method. 67 00:04:52,080 --> 00:04:55,662 get_text, and we'll print the region, 68 00:04:57,792 --> 00:05:02,357 get_text, and the income_level. 69 00:05:06,568 --> 00:05:12,371 Now, we can loop through the ISO codes, and pass them to our get_country method. 70 00:05:12,371 --> 00:05:16,718 So, if __name__, == '__main__':, 71 00:05:19,750 --> 00:05:22,976 Let's bring that up on the screen a little bit, 72 00:05:22,976 --> 00:05:27,186 I've included a file of ISO codes that we can open up and read. 73 00:05:27,186 --> 00:05:33,399 So, file, country_code, 74 00:05:33,399 --> 00:05:38,721 oop, country_iso_codes.csv, 75 00:05:38,721 --> 00:05:41,470 want to read that. 76 00:05:43,387 --> 00:05:51,883 Now, iso_codes, then, will be our reader, File, and our delimiter is ",". 77 00:05:53,990 --> 00:05:56,650 Now, we can loop through our file, and get our information. 78 00:05:58,870 --> 00:06:03,707 for code in iso_codes, and we want to pass in our code into our get_country 79 00:06:03,707 --> 00:06:09,198 method, and we want the first one from the list. 80 00:06:12,010 --> 00:06:13,678 Now, we can run world_bank. 81 00:06:18,215 --> 00:06:21,362 And it looks like I made a mistake back up here, 82 00:06:21,362 --> 00:06:25,360 it wasn't all lowercase, it's actually incomeLevel. 83 00:06:25,360 --> 00:06:30,340 Let's try it again, and we get all of our expected data. 84 00:06:30,340 --> 00:06:32,560 Again, we could do something else here, 85 00:06:32,560 --> 00:06:36,290 like saving the information to a csv file or database. 86 00:06:36,290 --> 00:06:38,630 Check the teachers' notes for more resources on that.