1 00:00:00,620 --> 00:00:02,703 When you're examining a set of data, 2 00:00:02,703 --> 00:00:08,500 sometimes you're unsure whether there's a correlation between two variables or not. 3 00:00:08,500 --> 00:00:12,338 When I say correlation, I mean how strong a relationship or 4 00:00:12,338 --> 00:00:15,500 connection is there between these two factors. 5 00:00:16,770 --> 00:00:21,826 In a scatter plot, we placed various points on a graph based on two different 6 00:00:21,826 --> 00:00:26,730 variable values, then study the correlation between those variables. 7 00:00:28,080 --> 00:00:32,398 Here's an example of a scatter plot that compares the temperature 8 00:00:32,398 --> 00:00:36,188 in Celsius to a store's daily ice cream sales in dollars. 9 00:00:36,188 --> 00:00:40,381 As you might expect, when it's colder than 5 degrees Celsius or 10 00:00:40,381 --> 00:00:46,120 under about 40 degrees Fahrenheit, very few people want to eat ice cream. 11 00:00:46,120 --> 00:00:50,926 Sales don't really pick up until the thermometer reaches around 25 degrees 12 00:00:50,926 --> 00:00:54,731 Celsius, which in Fahrenheit is about the mid to high 70s. 13 00:00:54,731 --> 00:00:57,960 And from there ice cream sales really accelerate. 14 00:00:57,960 --> 00:01:03,100 35 degrees Celsius equals 95 degrees Fahrenheit. 15 00:01:03,100 --> 00:01:07,070 And on a hot day like that, you can bet people want ice cream. 16 00:01:07,070 --> 00:01:09,710 So how do we judge a chart like this? 17 00:01:09,710 --> 00:01:13,868 I'm looking at the anatomy of a scatter plot from a website called the data 18 00:01:13,868 --> 00:01:16,170 visualization catalog. 19 00:01:16,170 --> 00:01:20,930 The words we would use to describe the relationship between temperature and 20 00:01:20,930 --> 00:01:24,660 ice cream sales are positive, exponential, and strong. 21 00:01:25,810 --> 00:01:29,063 Positive, because the data points trend from the lower left 22 00:01:29,063 --> 00:01:31,400 to the upper right of our chart. 23 00:01:31,400 --> 00:01:35,390 In other words, as temperature increases, so do ice cream sales. 24 00:01:36,400 --> 00:01:40,506 Exponential, because of the shape of our data forms a curve rather than 25 00:01:40,506 --> 00:01:42,410 a straight line. 26 00:01:42,410 --> 00:01:46,919 We don't see a steady increase in ice cream sales as temperatures rise above 27 00:01:46,919 --> 00:01:48,390 freezing. 28 00:01:48,390 --> 00:01:53,021 Rather, we see a small increase in sales during moderate temperatures and 29 00:01:53,021 --> 00:01:56,500 a massive leap in sales once it gets hot out. 30 00:01:56,500 --> 00:01:59,983 Finally, we say the correlation between temperature and 31 00:01:59,983 --> 00:02:04,760 ice cream sales is strong because there aren't a lot of outliers. 32 00:02:04,760 --> 00:02:09,190 Outliers are data points that don't fit the general pattern. 33 00:02:09,190 --> 00:02:12,757 If for some reason there was a spike in ice cream sales when 34 00:02:12,757 --> 00:02:16,103 the temperature was exactly 9 degrees Celsius, and 35 00:02:16,103 --> 00:02:21,630 sales dipped again at 10 degrees Celsius, that would be pretty unexpected. 36 00:02:21,630 --> 00:02:26,020 And the data point would be found pretty far from our general curve. 37 00:02:26,020 --> 00:02:28,239 The more outliers in your data, 38 00:02:28,239 --> 00:02:32,189 the weaker the correlation between your two variables. 39 00:02:32,189 --> 00:02:34,644 To look at a basketball example, 40 00:02:34,644 --> 00:02:40,710 here's a scatter plot comparing NBA win percentage versus point differential. 41 00:02:40,710 --> 00:02:42,954 When I say point differential, 42 00:02:42,954 --> 00:02:47,533 I mean the difference between points scored and points allowed. 43 00:02:47,533 --> 00:02:51,974 If your favorite team scores 110 points a game and allows only 100, 44 00:02:51,974 --> 00:02:56,880 that's a positive 10 point differential, which is really good. 45 00:02:56,880 --> 00:03:01,500 According to this chart, a team with a positive 10 point differential would 46 00:03:01,500 --> 00:03:04,670 be expected to win 80% of their games. 47 00:03:04,670 --> 00:03:07,524 The correlation between win percentage and 48 00:03:07,524 --> 00:03:11,233 point differential is positive, linear, and strong. 49 00:03:11,233 --> 00:03:16,948 Teams that outscore their opponents win games, while teams that don't, lose. 50 00:03:18,242 --> 00:03:23,000 On the other hand, here's another chart comparing win percentage to 51 00:03:23,000 --> 00:03:27,285 points scored. There might be a positive correlation here, but 52 00:03:27,285 --> 00:03:32,030 it's an extremely weak one with lots of outliers. The reason? 53 00:03:32,030 --> 00:03:37,110 A team needs a strong defense to win, not just players who can score. 54 00:03:37,110 --> 00:03:41,001 If your favorite team scores 110 points a game but 55 00:03:41,001 --> 00:03:44,550 allows 115, expect them to lose quite a bit. 56 00:03:46,080 --> 00:03:50,881 A phrase you'll hear often when studying scatter plots is that correlation 57 00:03:50,881 --> 00:03:53,510 doesn't equal causation. 58 00:03:53,510 --> 00:03:58,342 For example, if we were to plot ice cream sales versus swimsuit sales, 59 00:03:58,342 --> 00:04:03,960 we'd probably find a positive, exponential, and strong correlation. 60 00:04:03,960 --> 00:04:07,506 When ice cream sales are slow, so are swimsuit sales. 61 00:04:07,506 --> 00:04:12,740 And when the swimsuit sales pick up, the same is true for ice cream. 62 00:04:12,740 --> 00:04:16,959 But does that mean the increase in ice cream sales caused the increase in 63 00:04:16,959 --> 00:04:18,340 swimsuit sales? 64 00:04:18,340 --> 00:04:19,580 Not really. 65 00:04:19,580 --> 00:04:22,863 People rush to buy both items because of a third variable, 66 00:04:22,863 --> 00:04:24,540 an increase in temperature.