1 00:00:00,210 --> 00:00:04,020 As I've mentioned, histograms are used to show distributions of data. 2 00:00:04,020 --> 00:00:07,240 This can be very useful to see how closely grouped together or 3 00:00:07,240 --> 00:00:09,240 spread out a variable is. 4 00:00:09,240 --> 00:00:12,170 The area of the rectangles in a histogram is proportional 5 00:00:12,170 --> 00:00:14,540 to the frequency of the variable. 6 00:00:14,540 --> 00:00:15,160 This allows for 7 00:00:15,160 --> 00:00:18,590 the rough assessment of the probable distribution of a given variable. 8 00:00:19,620 --> 00:00:20,650 The rectangles or 9 00:00:20,650 --> 00:00:25,710 bends in a histogram, are important to consider when doing data visualization. 10 00:00:25,710 --> 00:00:27,710 Both the number of overall bends and 11 00:00:27,710 --> 00:00:31,120 the bend width can have an impact on the overall presentation of data. 12 00:00:32,360 --> 00:00:36,290 From our iris data set let's generate a histogram chart to see the distribution 13 00:00:36,290 --> 00:00:36,830 of petal length. 14 00:00:38,340 --> 00:00:41,850 Let's examine the petal lengths of the iris virginica class and 15 00:00:41,850 --> 00:00:43,710 visualize the distribution of that data. 16 00:00:43,710 --> 00:00:47,230 Here's where we start off in a new notebook, iris histogram, 17 00:00:47,230 --> 00:00:51,100 with bringing in our data and getting it stored in a list called irises. 18 00:00:51,100 --> 00:00:54,830 Let's process through this list to just obtain the petal length 19 00:00:54,830 --> 00:00:56,390 of the iris virginica species. 20 00:00:57,540 --> 00:00:59,307 Create a list, hold our data. 21 00:01:08,570 --> 00:01:11,114 Let's also create a variable for our bin numbers, 22 00:01:11,114 --> 00:01:14,441 so we can see how changing bin numbers impacts our visualization. 23 00:01:18,950 --> 00:01:21,201 Now let's loop through our data to get our petal lengths. 24 00:01:23,108 --> 00:01:28,424 For petal in range of our iris data. 25 00:01:50,043 --> 00:01:53,855 So if the species is Iris-virginica, 26 00:01:53,855 --> 00:02:00,561 we'll add the petal length to our virginica_petal_length list. 27 00:02:12,232 --> 00:02:13,937 And we'll get that from our iris data. 28 00:02:19,108 --> 00:02:22,390 Now we can pass our data into our plot.hist method. 29 00:02:22,390 --> 00:02:24,110 This method takes several parameters, 30 00:02:24,110 --> 00:02:27,220 including the number of bins we'd like to have. 31 00:02:27,220 --> 00:02:30,246 The color we'd like to set, along with alpha values. 32 00:02:30,246 --> 00:02:36,412 Plot.hist pass in our virginica_petal_length. 33 00:02:40,960 --> 00:02:42,209 Our number bins. 34 00:02:45,660 --> 00:02:47,948 The color of our plot will be red. 35 00:02:50,625 --> 00:02:55,260 And we give it an alpha value to make it slightly transparent. 36 00:02:55,260 --> 00:02:58,400 As I've mentioned, it's always important to add labels to your charts. 37 00:02:59,620 --> 00:03:01,119 For chart title. 38 00:03:05,403 --> 00:03:07,483 Iris-virginica Petal length. 39 00:03:13,500 --> 00:03:16,334 We'll give that a font size of 12. 40 00:03:16,334 --> 00:03:20,476 For our x-axis, for xlabel, 41 00:03:20,476 --> 00:03:23,978 we'll give it what it is, 42 00:03:23,978 --> 00:03:28,133 Petal length in centimeters. 43 00:03:30,367 --> 00:03:34,857 Font size of 10. 44 00:03:34,857 --> 00:03:36,710 And for our ylabel. 45 00:03:36,710 --> 00:03:38,622 We'll just call it Probability. 46 00:03:46,160 --> 00:03:48,562 And again, we'll give that a font size of 10. 47 00:03:54,077 --> 00:03:57,715 Cool and then we call our show method and run our cell. 48 00:04:02,392 --> 00:04:05,500 We are shown a histogram chart with red rectangles. 49 00:04:05,500 --> 00:04:08,180 However, the rectangles are all clumped together and 50 00:04:08,180 --> 00:04:10,270 can be a challenge to differentiate. 51 00:04:10,270 --> 00:04:16,050 Matplotlib allows for and includes some chart styling options which can help out. 52 00:04:16,050 --> 00:04:19,940 Let's apply matplotlib's classic style to our chart and 53 00:04:19,940 --> 00:04:21,060 see if it helps clear things up. 54 00:04:22,850 --> 00:04:28,725 We'll go back up here and under where we assign our figure size. 55 00:04:32,360 --> 00:04:37,965 We'll ask it to use the classic style and then we can run our cell. 56 00:04:40,452 --> 00:04:41,850 That's much better. 57 00:04:41,850 --> 00:04:44,010 Now we are setting our number of bins to ten, 58 00:04:44,010 --> 00:04:47,700 which is also the matplotlib default for histograms. 59 00:04:47,700 --> 00:04:52,902 Let's change that the 15 and then to 5 to see how that impacts our visualization. 60 00:04:58,345 --> 00:05:02,610 Notice here that at 15 bins we have some empty bins. 61 00:05:02,610 --> 00:05:07,150 While we get more detail about the data set, it also spreads the data into 62 00:05:07,150 --> 00:05:11,430 a broken comb look that doesn't provide as clear of a picture of the distribution. 63 00:05:11,430 --> 00:05:14,125 And if we go back and set it to 5 bins. 64 00:05:19,668 --> 00:05:23,390 At 5 bins, the data isn't portrayed very well either. 65 00:05:23,390 --> 00:05:27,460 There are a variety of formulas and considerations for the number of bins and 66 00:05:27,460 --> 00:05:28,560 their widths to use. 67 00:05:28,560 --> 00:05:33,100 I've included links to some resources for these in the teacher's notes. 68 00:05:33,100 --> 00:05:37,200 It is not uncommon in practice to produce multiple histograms 69 00:05:37,200 --> 00:05:41,220 with different numbers of bins, before settling on the best communication tool. 70 00:05:42,540 --> 00:05:46,140 Histograms are great for exploring the distribution of data, but 71 00:05:46,140 --> 00:05:49,100 our data set has many more ways that it can be explored. 72 00:05:49,100 --> 00:05:50,600 Sepal length and sepal and 73 00:05:50,600 --> 00:05:54,610 pedal width, can all be explored across all different species. 74 00:05:54,610 --> 00:05:56,020 Before the next video, 75 00:05:56,020 --> 00:05:59,450 practice creating some other histograms of this data on your own. 76 00:05:59,450 --> 00:06:01,510 Next, we'll look at box plots.