1 00:00:00,450 --> 00:00:04,100 We left off with our data in a list, just waiting to be used. 2 00:00:04,100 --> 00:00:07,960 Let's utilize a scatter plot to see what correlations, if any, 3 00:00:07,960 --> 00:00:11,400 there are between the sepal length and width, based on the variety of aggregates. 4 00:00:12,620 --> 00:00:16,994 [SOUND] Recall that scatter plots are used to show how much one variable is impacted 5 00:00:16,994 --> 00:00:18,875 by another, or its correlation. 6 00:00:18,875 --> 00:00:22,360 We use scatter plots to show relationships between values. 7 00:00:22,360 --> 00:00:24,782 In this case, our sepal length and width. 8 00:00:24,782 --> 00:00:28,850 The scatter plot allows us to quickly visualize the distribution of the data and 9 00:00:28,850 --> 00:00:30,860 notice any outliers. 10 00:00:30,860 --> 00:00:33,620 We can see if there’s a positive, negative, or 11 00:00:33,620 --> 00:00:37,210 nonexistent correlation between our data based on the scatter plot results. 12 00:00:38,340 --> 00:00:40,910 Let's jump back into our Python code and develop our chart. 13 00:00:42,060 --> 00:00:46,835 Let's get started from the previous video's code and 14 00:00:46,835 --> 00:00:50,686 rename name this notebook iris_scatter. 15 00:00:58,973 --> 00:01:02,207 Since we'll be starting our work with plots now, 16 00:01:02,207 --> 00:01:06,597 we'll need to have our matplotlib.pyplot import to our project. 17 00:01:06,597 --> 00:01:12,470 matplotlib.pyplot as plt. 18 00:01:12,470 --> 00:01:13,810 Let's create a dict for 19 00:01:13,810 --> 00:01:17,580 our marker colors that we can use as we loop through our list information. 20 00:01:18,660 --> 00:01:22,720 The colors allow us to see the different iris classes more easily, and 21 00:01:22,720 --> 00:01:24,890 visualize a third variable in our data set. 22 00:01:25,980 --> 00:01:27,932 I'll paste that dict in here. 23 00:01:27,932 --> 00:01:30,371 I have included a copy of it in the teacher's notes. 24 00:01:38,282 --> 00:01:42,760 Our colors then are the blue hue for setosa. 25 00:01:42,760 --> 00:01:45,950 Green, that short code, for versicolor. 26 00:01:45,950 --> 00:01:48,100 And purple for virginica. 27 00:01:48,100 --> 00:01:52,502 Our list of iris data also includes an extra item that we don't need at the end, 28 00:01:52,502 --> 00:01:53,845 so let's pop that off. 29 00:01:57,943 --> 00:02:01,219 Now we'll want to loop through our array, and assign our x and 30 00:02:01,219 --> 00:02:04,270 y-values to the sepal length and width. 31 00:02:04,270 --> 00:02:08,470 These are located in the first and second columns of our array, respectively. 32 00:02:09,630 --> 00:02:14,570 We can use a function in the itertools library called groupby that allows us to 33 00:02:14,570 --> 00:02:15,950 easily do that. 34 00:02:15,950 --> 00:02:18,516 Let's add that import first and I'll show you the code. 35 00:02:23,111 --> 00:02:27,075 From itertools import groupby. 36 00:02:30,521 --> 00:02:34,317 If you haven't used itertools, it is a module that provides functions for 37 00:02:34,317 --> 00:02:36,070 efficient looping. 38 00:02:36,070 --> 00:02:38,920 Check the teacher's notes for additional information. 39 00:02:38,920 --> 00:02:45,896 So to start this, species and 40 00:02:45,896 --> 00:02:50,365 group in groupby, 41 00:02:54,802 --> 00:02:58,531 Group is a generator, so you can only go over it one time. 42 00:03:09,535 --> 00:03:13,195 And then we'll get our sepal length. 43 00:03:16,822 --> 00:03:18,889 It's gonna be the float value. 44 00:03:32,381 --> 00:03:34,736 Sepal widths, similar. 45 00:03:51,757 --> 00:03:53,950 Then we assign that to plt.scatter. 46 00:03:55,390 --> 00:04:01,769 Sepal_lengths, sepal_widths for our y-value there. 47 00:04:05,004 --> 00:04:07,345 We'll assign it a marker size of 10. 48 00:04:10,257 --> 00:04:14,583 C for the colors will come from our colors dict, and grab the species. 49 00:04:17,515 --> 00:04:19,840 And we'll label based on species as well. 50 00:04:22,580 --> 00:04:27,920 Now, before we call plt.show, let's add a plot title, access labels, 51 00:04:27,920 --> 00:04:31,660 and legend to our chart here to add context to our data. 52 00:04:31,660 --> 00:04:34,220 This is an important thing to remember. 53 00:04:34,220 --> 00:04:38,040 Always label your access, legends, and charts. 54 00:04:38,040 --> 00:04:45,768 Plt.title, Fisher's Iris Data Set. 55 00:04:47,562 --> 00:04:52,298 We'll give that a fontsize of 12. 56 00:04:54,716 --> 00:04:56,024 Bring that up a little bit. 57 00:04:58,032 --> 00:04:58,930 Our xlabel. 58 00:05:01,600 --> 00:05:05,380 These are our sepal lengths in centimeters. 59 00:05:07,150 --> 00:05:09,332 We’ll assign that a fontsize of 10. 60 00:05:11,233 --> 00:05:14,930 For our ylabel, these are our sepal widths. 61 00:05:16,740 --> 00:05:21,412 Again, in centimeters, and we'll give that a fontsize of 10 as well. 62 00:05:27,016 --> 00:05:31,054 We'll call plt.legend, And 63 00:05:31,054 --> 00:05:35,103 we'll give this a location in the upper right. 64 00:05:37,091 --> 00:05:40,412 Here we are setting the legend location to be displayed in the upper right 65 00:05:40,412 --> 00:05:41,660 of the chart. 66 00:05:41,660 --> 00:05:46,380 But we could display it in the upper left, upper center, bottom left, etc. 67 00:05:46,380 --> 00:05:50,810 Since there aren't any data points being displayed in the upper right, 68 00:05:50,810 --> 00:05:52,970 that seems like a good position. 69 00:05:52,970 --> 00:05:54,965 Now we just call plt.show. 70 00:05:58,431 --> 00:05:59,680 And run our cell. 71 00:06:05,012 --> 00:06:07,356 We can see some patterns here in our sepal data. 72 00:06:07,356 --> 00:06:12,580 Iris-setosa is a pretty good grouping in the upper left quadrant of our chart. 73 00:06:12,580 --> 00:06:14,610 There are some outliers though. 74 00:06:14,610 --> 00:06:17,770 The other two varieties seem to be clumped together and 75 00:06:17,770 --> 00:06:20,400 intermixed with some even greater outliers. 76 00:06:20,400 --> 00:06:22,650 Our plot looks a bit small here, though. 77 00:06:22,650 --> 00:06:26,660 Let's assign a size to our figure to make it a bit easier to see. 78 00:06:26,660 --> 00:06:30,218 We do that, Go up here, 79 00:06:30,218 --> 00:06:34,196 right under input_file, that's kind of a standard spot for it. 80 00:06:36,833 --> 00:06:40,258 We attach something to the figure object, figsize. 81 00:06:44,647 --> 00:06:50,741 7.5, 4.25 seems to work pretty well, and we can run our cell again. 82 00:06:54,861 --> 00:06:57,110 There, that's better. 83 00:06:57,110 --> 00:07:01,620 From an analysis standpoint, we could draw some conclusions based on this chart. 84 00:07:01,620 --> 00:07:05,400 It appears that all three iris varieties have a positive correlation 85 00:07:05,400 --> 00:07:06,815 between sepal length and width. 86 00:07:06,815 --> 00:07:13,290 Iris-setosa has a better defined positive correlation than the other varieties. 87 00:07:13,290 --> 00:07:16,630 Scatter plots are, of course, only one way to explore our data.