Analyzing Data Spread3:53 with Ben Deitch
Data isn't always distributed the way you want. In this video we'll talk about a few of the different ways we can measure the spread of our data.
We've got the extremes of our data and we've got the middle. 0:00 But how is our data distributed? 0:03 One common way to describe the spread of our data is to use the standard deviation 0:05 which is commonly represented as the Greek letter sigma. 0:10 The standard deviation aims to tell us how far away our data is from the average. 0:13 To calculate it, 0:18 we start by taking the difference between each value and the average. 0:19 Then we square each of those values, add them up, and 0:23 divide by the total number of values. 0:26 This gives us the standard deviation squared which is also called the variance. 0:29 So to get this standard deviation, we just take the square root and there we go. 0:34 We've got a standard deviation of 64.29, so if we were to put this on a graph, 0:39 we'd put the average in the middle and then go 64.29 above and below the average. 0:44 Then we can say that any data in this range is within one standard deviation 0:50 of the average. 0:55 So that's a pretty big range. 0:56 Let's see what happens if instead of a perfect game, our first bowler, 0:58 bowls a 135. 1:03 Now, instead of an average of 134.5, we've got an average of about 114 and 1:04 our standard deviation is all the way down to just 17. 1:10 So if we make a plot of this new standard deviation, we can see that this data 1:14 is much more clustered together than when it included a perfect game. 1:19 Let's calculate the standard deviation for the finishing times. 1:23 First, let's add a new label for Standard Deviation in row nine. 1:27 And let's make it bold and 1:36 then double-click right here to automatically set the width of the column. 1:38 Then, in the cell next to it, let's type =STDEV and hit Enter to select a function. 1:45 Then let's paste in the range and hit Enter again and 1:54 it looks like we've got a Standard Deviation of about 42 minutes. 1:58 Also, if you're not seeing 42 minutes here, you can come over here and 2:02 change the data type to Duration and that should fix your issue. 2:07 So most racers finished within 42 minutes of the average finish time. 2:12 But standard deviation doesn't tell the whole story, 2:17 it only tells us how compact or spread out our data is. 2:21 To get the rest of the picture, we need to talk about skew. 2:25 Skew is when your data seems to favor one side over the other. 2:29 Most of the data is either to the right or left of the middle. 2:34 And depending on which side has the long tail, 2:37 you would say that this data is either skewed negatively or positively. 2:40 An easy way to remember skew directions is to start at the peak and 2:45 draw an arrow towards the long tail. 2:49 The direction that arrow points is how the data is skewed. 2:52 So this data has a negative skew. 2:56 On the other hand, if your data has no skew and its mean, median, and 2:59 mode are all right in the middle, then your data is said to have 3:04 a normal distribution which is frequently referred to as a bell curve. 3:08 Normal distributions have many convenient properties and 3:13 they occur fairly frequently in real life. 3:16 People's heights, test scores, and 3:19 even blood pressures are all normally distributed. 3:21 One property of normal distributions is how many values occur within a given 3:25 standard diviation of the mean. 3:29 68% of the data should be contained within 1 standard deviation, 3:30 95% should be contained within 2. 3:35 And if you go out to 3 standard deviations at 99.7%, 3:39 that should be pretty much all of the data. 3:44 Let's see if our data is normally distributed by seeing how 3:46 close we come to these numbers in the next video. 3:49
You need to sign up for Treehouse in order to download course files.Sign up