1 00:00:00,400 --> 00:00:01,900 Welcome back. 2 00:00:01,900 --> 00:00:04,060 We've seen that we can pass in series data, 3 00:00:04,060 --> 00:00:08,488 such as lists, into our figure object, to ultimately generate our plot. 4 00:00:08,488 --> 00:00:12,650 We can pass in other sources of data as well, such as pics, numpy arrays or 5 00:00:12,650 --> 00:00:14,200 Panda's data frames. 6 00:00:14,200 --> 00:00:17,380 However, Bokeh provides its own implementation 7 00:00:17,380 --> 00:00:19,980 of series data that we can use, the column data source. 8 00:00:21,120 --> 00:00:26,010 Column data source is a table like data object which maps column names 9 00:00:26,010 --> 00:00:27,690 to sequences or arrays of data. 10 00:00:28,710 --> 00:00:32,130 It takes in a data series such as a Python dictionary and 11 00:00:32,130 --> 00:00:34,190 maps the key to the sequence of values. 12 00:00:35,270 --> 00:00:38,990 In fact, behind the scenes, when we pass in x equals list one, 13 00:00:38,990 --> 00:00:43,750 two and three, bokeh is making a column data source with a name of x 14 00:00:43,750 --> 00:00:46,050 with a mapping of values as one, two, and three. 15 00:00:47,200 --> 00:00:49,260 It also allows for access in the values. 16 00:00:49,260 --> 00:00:53,160 Inside the data, much like one would with a Python dictionary and 17 00:00:53,160 --> 00:00:55,630 easily accepts pandas data frames as values. 18 00:00:56,890 --> 00:01:00,510 In upcoming videos we'll also look at some additional features of having our data 19 00:01:00,510 --> 00:01:03,910 in a column data source, then make using it in our Bokeh projects 20 00:01:03,910 --> 00:01:07,960 more useful than sticking with pandas data frames or numpy arrays. 21 00:01:07,960 --> 00:01:11,630 We can add in data to the data source for hover tool tips and 22 00:01:11,630 --> 00:01:14,908 even use the column data source for linked visualizations. 23 00:01:14,908 --> 00:01:17,835 Again, we'll be taking a look at this feature soon. 24 00:01:17,835 --> 00:01:21,370 There's something to note about column data source. 25 00:01:21,370 --> 00:01:24,890 Like in the previous video, the columns must all be the same length. 26 00:01:24,890 --> 00:01:28,400 Missing or ragged data is not permitted within a single data source. 27 00:01:29,460 --> 00:01:32,630 Let's have a look at how we can utilize column data source to handle some data, 28 00:01:32,630 --> 00:01:35,680 and start to visualize some actual figures. 29 00:01:35,680 --> 00:01:38,660 To get started, you want to download the course files and 30 00:01:38,660 --> 00:01:40,230 get them set up in your favorite editor. 31 00:01:41,660 --> 00:01:43,661 I'll head back into PyCharm. 32 00:01:43,661 --> 00:01:47,669 The starter files include the dataset we'll be using, country-pops.csv, 33 00:01:47,669 --> 00:01:52,470 along with the requirements.txt file which we already installed in a previous video. 34 00:01:52,470 --> 00:01:58,250 Let's go ahead and create a new file called stage1-3.py. 35 00:01:58,250 --> 00:02:01,620 Here, we'll want to get our CSV file imported and 36 00:02:01,620 --> 00:02:03,870 take a look at the format of our data. 37 00:02:03,870 --> 00:02:07,710 Since this file contains information for all the countries of the world, 38 00:02:07,710 --> 00:02:12,600 let's remind ourselves why we're exploring data to just five rows and 39 00:02:12,600 --> 00:02:14,000 display the header information as well. 40 00:02:15,110 --> 00:02:18,800 We'll need to import numpy as np, 41 00:02:20,810 --> 00:02:27,310 import pandas as pd, again, that's a pretty standard naming convention for those. 42 00:02:30,810 --> 00:02:35,846 We're gonna be bringing in our country-pops.csv file. 43 00:02:41,383 --> 00:02:45,718 And we'll use the pandas.read_csv method. 44 00:02:47,860 --> 00:02:48,729 Bring this in. 45 00:02:51,866 --> 00:02:57,332 And then for data exploration, we'll create a np.array, 46 00:03:01,339 --> 00:03:04,203 With a header, and 47 00:03:04,203 --> 00:03:09,516 we need to limit our rows here, nrows=5. 48 00:03:14,541 --> 00:03:15,963 And print(countries_array). 49 00:03:20,510 --> 00:03:25,393 And when we run this, we can see then that in our data, 50 00:03:25,393 --> 00:03:28,550 we have quite a bit of useful information. 51 00:03:28,550 --> 00:03:34,340 The English and German names of the country, code code, population, etc. 52 00:03:34,340 --> 00:03:36,760 As is often the case with datasets, 53 00:03:36,760 --> 00:03:41,450 there's information included that you might not necessarily be interested in for 54 00:03:41,450 --> 00:03:44,740 the current project, such as birth and death rates. 55 00:03:44,740 --> 00:03:49,030 But, it's great to have that information available for our future explorations. 56 00:03:49,030 --> 00:03:52,550 Now that we have a table of data, we should put it to use. 57 00:03:52,550 --> 00:03:56,600 Imagine that you have this information in your favorite spreadsheet application. 58 00:03:56,600 --> 00:03:59,750 Let start with the chart that people will often generate in Excel. 59 00:03:59,750 --> 00:04:01,520 A bar chart for example. 60 00:04:01,520 --> 00:04:03,624 We'll need to bring in our imports and 61 00:04:03,624 --> 00:04:06,657 in this case import bokeh.charts to plot a bar chart. 62 00:04:19,320 --> 00:04:23,280 And from bokeh.io, we'll import our output file and show methods. 63 00:04:25,850 --> 00:04:27,430 Then we define our output file. 64 00:04:27,430 --> 00:04:30,043 Let's call it population.html. 65 00:04:36,742 --> 00:04:40,932 Next, we need to build out our bar chart by passing in our countries pandas 66 00:04:40,932 --> 00:04:42,560 data frame. 67 00:04:42,560 --> 00:04:45,610 We can do that here at the bottom and tell it which information to use. 68 00:04:47,940 --> 00:04:49,114 We'll just call it bar_chart. 69 00:05:21,158 --> 00:05:26,315 Next, we need to pass in our bar chart object into the show method. 70 00:05:28,970 --> 00:05:33,595 show(bar_chart) and then run our script. 71 00:05:38,285 --> 00:05:42,725 Since our n rows is still set to five, we see the five countries displayed on a bar 72 00:05:42,725 --> 00:05:46,110 chart showing their representative populations. 73 00:05:46,110 --> 00:05:47,260 Pretty cool. 74 00:05:47,260 --> 00:05:50,320 You'll notice that we are setting our legend to false so 75 00:05:50,320 --> 00:05:51,900 that it doesn't display. 76 00:05:51,900 --> 00:05:55,860 In a later video, we will take a look at how to customize our chart legends. 77 00:05:55,860 --> 00:05:59,080 There are a variety of other charts that can be generated as well, and 78 00:05:59,080 --> 00:06:01,600 I've included links to them in the teachers notes. 79 00:06:01,600 --> 00:06:04,470 While, pandas data frames work great for many applications. 80 00:06:04,470 --> 00:06:08,480 Bokeh provides another option that allows for some cleaner code. 81 00:06:08,480 --> 00:06:12,810 And as we'll find out in a later video, some great additional features. 82 00:06:12,810 --> 00:06:16,440 To take advantage of this, we utilize Bokeh's column data source. 83 00:06:17,470 --> 00:06:20,910 This will map the column names to the sequences of data. 84 00:06:20,910 --> 00:06:24,730 Let's go a bit beyond bar charts, or pie graphs with their data. 85 00:06:24,730 --> 00:06:28,110 Since we're attempting to provide some data analysis in our reporting, and 86 00:06:28,110 --> 00:06:32,030 see what, if any, correlation there is between the country's population, and 87 00:06:32,030 --> 00:06:33,620 life expectancy. 88 00:06:33,620 --> 00:06:38,051 Before we can use it, we need to import column data source from bokeh.plotting and 89 00:06:38,051 --> 00:06:39,630 pass in our data. 90 00:06:39,630 --> 00:06:44,787 We'll need to do our imports from Bokeh along with the column data 91 00:06:44,787 --> 00:06:50,906 source import from bokeh.plotting, we want to import ColumnDataSource. 92 00:06:54,763 --> 00:06:56,080 And figure. 93 00:06:56,080 --> 00:06:58,360 Since countries' a pandas data frame, 94 00:06:58,360 --> 00:07:01,790 we can pass that into ColumnDataSource as our data source. 95 00:07:01,790 --> 00:07:05,521 And since we don't need the country array or print function any longer, 96 00:07:05,521 --> 00:07:07,980 we can clean up our code and delete those lines. 97 00:07:09,820 --> 00:07:13,800 And, while we're at it, let's rename our output file to something more meaningful. 98 00:07:14,880 --> 00:07:21,320 Since we're examining population versus life expectancy, 99 00:07:21,320 --> 00:07:26,158 let's use pop-life.html as our file name. 100 00:07:26,158 --> 00:07:28,150 And we can get rid of the numpy import as well. 101 00:07:29,320 --> 00:07:32,027 And we won't be needing bar any longer. 102 00:07:38,573 --> 00:07:39,073 Great. 103 00:07:40,210 --> 00:07:45,100 We wanna set up our country data. 104 00:07:45,100 --> 00:07:49,640 Just taking our ColumnDataSource, and we want our countries. 105 00:07:53,630 --> 00:07:56,456 Now, we just need to build out our figure plot, 106 00:07:56,456 --> 00:07:58,867 similar to what we have done in the past. 107 00:07:58,867 --> 00:08:01,587 But, we'll use circle ellipse here and 108 00:08:01,587 --> 00:08:05,270 not pass in any tool parameters to accept the default. 109 00:08:06,730 --> 00:08:07,494 Our plot. 110 00:08:13,595 --> 00:08:16,010 And we'll label our x_axis as Population. 111 00:08:21,150 --> 00:08:26,090 Our y_axis is Life Expectancy. 112 00:08:31,140 --> 00:08:35,933 Great, we'll do a plot.circles, 113 00:08:35,933 --> 00:08:40,423 pop in population for our x value. 114 00:08:45,409 --> 00:08:46,549 Life_expectancy. 115 00:08:48,803 --> 00:08:49,762 For y value. 116 00:08:52,130 --> 00:08:55,115 Our source data is our country data. 117 00:08:58,319 --> 00:08:59,940 And we'll make our glyphs 15 points. 118 00:09:01,250 --> 00:09:05,620 That should look roughly familiar, but what is going on there with our x and 119 00:09:05,620 --> 00:09:06,480 y values. 120 00:09:06,480 --> 00:09:09,990 And what does that source equals country_data bit? 121 00:09:09,990 --> 00:09:13,320 Well, we are telling our plot to use our source variable, 122 00:09:13,320 --> 00:09:18,240 our ColumnDataSource information, as the source of our data. 123 00:09:18,240 --> 00:09:22,010 Our x value is a population value inside that data set, and 124 00:09:22,010 --> 00:09:27,570 our y values are the values of the matching Life Expectancy column of data. 125 00:09:27,570 --> 00:09:32,020 Notice that capitalization matters here, and that we need to match our x and 126 00:09:32,020 --> 00:09:35,600 y values to the correct names of our data set column headers. 127 00:09:36,630 --> 00:09:40,600 Now, we just need to tell Bokeh to show us our plot and we'll be all set. 128 00:09:42,990 --> 00:09:45,340 Show plot, and run our script. 129 00:09:51,160 --> 00:09:55,320 Awesome we have a plot showing us life expectancy versus population for 130 00:09:55,320 --> 00:09:56,950 our five countries. 131 00:09:56,950 --> 00:10:00,050 Great work, we have seen several things here. 132 00:10:00,050 --> 00:10:04,416 How to utilize pandas data frames as a source of data and get it to work with 133 00:10:04,416 --> 00:10:09,157 Bokeh column data source, and plot that data based on specific column names. 134 00:10:09,157 --> 00:10:12,995 It's not very helpful right now though in terms of data exploration because we can't 135 00:10:12,995 --> 00:10:14,230 tell which glyph is which. 136 00:10:15,870 --> 00:10:19,110 When we come back, we'll look at adding specific colors to our glyphs and 137 00:10:19,110 --> 00:10:22,210 add legends to our graph to help better understand our data.