1
00:00:00,025 --> 00:00:03,460
We've got the extremes of our data and
we've got the middle.
2
00:00:03,460 --> 00:00:05,570
But how is our data distributed?
3
00:00:05,570 --> 00:00:10,240
One common way to describe the spread of
our data is to use the standard deviation
4
00:00:10,240 --> 00:00:13,490
which is commonly represented
as the Greek letter sigma.
5
00:00:13,490 --> 00:00:17,890
The standard deviation aims to tell us how
far away our data is from the average.
6
00:00:18,890 --> 00:00:19,730
To calculate it,
7
00:00:19,730 --> 00:00:23,730
we start by taking the difference
between each value and the average.
8
00:00:23,730 --> 00:00:26,950
Then we square each of those values,
add them up, and
9
00:00:26,950 --> 00:00:29,580
divide by the total number of values.
10
00:00:29,580 --> 00:00:34,550
This gives us the standard deviation
squared which is also called the variance.
11
00:00:34,550 --> 00:00:39,110
So to get this standard deviation, we just
take the square root and there we go.
12
00:00:39,110 --> 00:00:44,880
We've got a standard deviation of 64.29,
so if we were to put this on a graph,
13
00:00:44,880 --> 00:00:50,940
we'd put the average in the middle and
then go 64.29 above and below the average.
14
00:00:50,940 --> 00:00:55,210
Then we can say that any data in this
range is within one standard deviation
15
00:00:55,210 --> 00:00:56,630
of the average.
16
00:00:56,630 --> 00:00:58,741
So that's a pretty big range.
17
00:00:58,741 --> 00:01:03,453
Let's see what happens if instead of
a perfect game, our first bowler,
18
00:01:03,453 --> 00:01:04,543
bowls a 135.
19
00:01:04,543 --> 00:01:10,726
Now, instead of an average of 134.5,
we've got an average of about 114 and
20
00:01:10,726 --> 00:01:14,830
our standard deviation is
all the way down to just 17.
21
00:01:14,830 --> 00:01:19,100
So if we make a plot of this new standard
deviation, we can see that this data
22
00:01:19,100 --> 00:01:22,461
is much more clustered together than
when it included a perfect game.
23
00:01:23,660 --> 00:01:27,390
Let's calculate the standard deviation for
the finishing times.
24
00:01:27,390 --> 00:01:31,724
First, let's add a new label for
Standard Deviation in row nine.
25
00:01:36,784 --> 00:01:38,547
And let's make it bold and
26
00:01:38,547 --> 00:01:43,770
then double-click right here to
automatically set the width of the column.
27
00:01:45,420 --> 00:01:52,680
Then, in the cell next to it, let's type
=STDEV and hit Enter to select a function.
28
00:01:54,020 --> 00:01:58,220
Then let's paste in the range and
hit Enter again and
29
00:01:58,220 --> 00:02:02,610
it looks like we've got
a Standard Deviation of about 42 minutes.
30
00:02:02,610 --> 00:02:07,550
Also, if you're not seeing 42 minutes
here, you can come over here and
31
00:02:07,550 --> 00:02:11,180
change the data type to Duration and
that should fix your issue.
32
00:02:12,240 --> 00:02:17,700
So most racers finished within 42
minutes of the average finish time.
33
00:02:17,700 --> 00:02:21,000
But standard deviation
doesn't tell the whole story,
34
00:02:21,000 --> 00:02:25,670
it only tells us how compact or
spread out our data is.
35
00:02:25,670 --> 00:02:28,560
To get the rest of the picture,
we need to talk about skew.
36
00:02:29,730 --> 00:02:34,050
Skew is when your data seems to
favor one side over the other.
37
00:02:34,050 --> 00:02:37,520
Most of the data is either to the right or
left of the middle.
38
00:02:37,520 --> 00:02:40,340
And depending on which
side has the long tail,
39
00:02:40,340 --> 00:02:45,950
you would say that this data is either
skewed negatively or positively.
40
00:02:45,950 --> 00:02:49,690
An easy way to remember skew
directions is to start at the peak and
41
00:02:49,690 --> 00:02:52,340
draw an arrow towards the long tail.
42
00:02:52,340 --> 00:02:56,500
The direction that arrow points
is how the data is skewed.
43
00:02:56,500 --> 00:02:59,370
So this data has a negative skew.
44
00:02:59,370 --> 00:03:04,130
On the other hand, if your data has
no skew and its mean, median, and
45
00:03:04,130 --> 00:03:08,660
mode are all right in the middle,
then your data is said to have
46
00:03:08,660 --> 00:03:12,730
a normal distribution which is
frequently referred to as a bell curve.
47
00:03:13,740 --> 00:03:16,730
Normal distributions have many
convenient properties and
48
00:03:16,730 --> 00:03:19,580
they occur fairly frequently in real life.
49
00:03:19,580 --> 00:03:21,610
People's heights, test scores, and
50
00:03:21,610 --> 00:03:25,200
even blood pressures are all
normally distributed.
51
00:03:25,200 --> 00:03:29,350
One property of normal distributions
is how many values occur within a given
52
00:03:29,350 --> 00:03:30,925
standard diviation of the mean.
53
00:03:30,925 --> 00:03:35,832
68% of the data should be contained
within 1 standard deviation,
54
00:03:35,832 --> 00:03:39,080
95% should be contained within 2.
55
00:03:39,080 --> 00:03:44,020
And if you go out to 3
standard deviations at 99.7%,
56
00:03:44,020 --> 00:03:46,700
that should be pretty
much all of the data.
57
00:03:46,700 --> 00:03:49,920
Let's see if our data is normally
distributed by seeing how
58
00:03:49,920 --> 00:03:52,800
close we come to these
numbers in the next video.