Analyzing the Data6:34 with Ben Deitch
It's not always obvious how you should analyze your data. In this video we'll analyze whether some ages are over-represented at the Boston Marathon.
All right, we've got our lines in. 0:00 Now it's time to find out if any of our data crosses the line. 0:02 Let's start by creating a table of ages and 0:06 how many runners of each age ran the race. 0:08 Let's leave a space below Max Difference and then in row 6 let's add two labels, 0:12 Age and Count. 0:18 And let's go ahead and bold our labels. 0:22 Then, below our age label, let's add 18, and then 19. 0:25 Then let's highlight both of those cells, and drag down until we have an entry for 0:31 each age between the min at 18 and the max at 84. 0:38 Next, under the count label, 0:51 we need to count how many runners were the age reported in column A. 0:54 To do this, we'll use the Count if function, 1:00 which counts all the values in a range, but only if they meet certain criteria. 1:03 So, under our Count label, let's type =CountIf and hit Enter. 1:07 And then, we need to specify the range we are interested in, 1:14 which in this case is the ages of all the runners. 1:17 We already have that range and the formulas up here. 1:21 So I'll take a second to copy that. 1:24 And then down here, looks like I lost my CountIf, 1:32 but we can bring it back, and paste in the range. 1:36 Then, for the second parameter, 1:40 we need to check that the value is equal to the number in column A. 1:42 And remember, this needs to be represented as a string. 1:46 So let's type, in quotation marks, an equal sign, and then 1:50 let's concatenate that equal sign with the cell in column A by typing an ampersand. 1:55 And then clicking over here to select the cell and 2:02 hit Enter to finish our account statement. 2:05 Finally let's the drag list formula down to page 84 to give us the counts. 2:07 And notice that if we click on this last count, 2:25 the range of ages is locked in as C2 to C26,411. 2:30 You can get some pretty hard to find errors if you forget to lock 2:36 in these rangers by using the dollar signs. 2:39 So, if you want to be using the exact same range in a bunch of different cells, 2:42 make sure you're specifying your range with those dollar signs. 2:47 Now that we've got all our accounts, lets go back to the top and 2:52 add another column to keep track of the difference between the two ages. 2:55 Let's call it difference and 3:00 then quickly bold it, and then let's skip row 7 and 3:04 then row 8, let's type =B8-B7. 3:10 Then let's drag from there all the way to the bottom to give us all of our 3:14 differences. 3:18 We can already tell that some of these values are over 200. 3:21 But let's try and make them standout by using some conditional formatting. 3:26 Conditional formatting is formatting that only happens 3:30 once a certain condition is met. 3:33 For example, if you wanted to highlight all the positive cells in green and 3:35 all the negative cells in red, you could use conditional formatting. 3:40 To add conditional formatting to a Difference column, 3:43 let's select all of the differences And 3:46 then choose Format, Conditional formatting. 3:52 On the right we have options for our formatting. 3:57 Let's change the condition from if Cell is not empty to if Greater than 4:00 And add a value of 200. 4:10 Finally, let's change the formatting style to have a red background. 4:13 And there we go. 4:18 It looks like we've got one, two, three, 4:19 four instances where the difference is more than 200. 4:24 However, we really shouldn't be hard coding this 200 value. 4:30 It should be set to the value of our max difference. 4:33 So, instead of 200, let's type =, and this is cell B4 and 4:38 we're not allowed click it, which means we have to type it in. 4:44 Well that doesn't look quite right. 4:51 What's going on here? 4:53 Well, it turns out that when we type B4 like this, it starts with C8 and 4:54 it compares it to B4, but then it moves on to C9 and it compares it to B5 and 5:01 C10 with B6, and so on. 5:07 If we want this to compare each cell with B4, 5:12 we need to use dollar signs to make this an absolute reference. 5:15 That's better. 5:23 Then let's click done, and 5:24 by our criteria, we can now answer the question we set out to solve. 5:27 Some ages do have an easier time qualifying than others. 5:32 Though before we move on, 5:36 we should briefly talk about statistical significance. 5:38 Imagine you just created a new drug. 5:41 Well, before you'd be able to sell that drug, you'd need to make sure it was safe. 5:44 You'd want to test it on a small subset of people 5:49 before releasing it to the whole population. 5:51 But how many people would you need to test it on? 5:54 If you tried it on five people and they all turned out fine, 5:57 would you say that it's safe for the whole population? 6:00 No way, but then, how many people do you need? 6:03 1,000? 6:07 10,000? 6:08 Either way, at some point your sample is big enough, 6:10 that you can start to drag conclusions about the whole population. 6:12 This is what's statistical significance is all about. 6:17 It tells us if our findings can be safely extrapolated to the whole population. 6:20 If you'd like to read more about statistical significance, 6:26 check out the teacher's notes below. 6:28 Coming up we'll look at how we can present our findings to management. 6:30
You need to sign up for Treehouse in order to download course files.Sign up