Defining Terms4:38 with Ben Deitch
Before you start an analysis, you'll first want to define the question you're trying to answer.
Management typically starts the data analysis process 0:00 when they need to know something. 0:03 Let's work through this process by pretending the boss of an athletic 0:05 association has received complaints that some ages have an easier time qualifying 0:08 than others and they've tasked us with getting to the bottom of it. 0:12 The first we'll need to do is define the question. 0:16 Let's go with, do some ages have an easier time qualifying for the Boston Marathon? 0:19 Awesome. 0:25 Next, we need to turn that question into something concrete, 0:26 something we'll be able to answer with our data. 0:29 One way to find out if some ages have an easier time qualifying 0:32 is to compare the number of participants for each age. 0:36 If we find a big difference between two consecutive ages, 0:39 then we'll know something is up. 0:43 But this provides us with another issue. 0:44 How do we define a big difference? 0:47 Remember, we're trying to answer a yes or no question. 0:50 So at some point, we need to draw a line in the sand and 0:53 say, this difference is too much. 0:56 Analyzing data is all about asking questions. 0:58 You wanna approach each step and decision along the way with an inquisitive mind. 1:02 Always asking if things need clarification, or 1:06 could be better in anyway. 1:09 So, in this example, we're asking what's an appropriate difference. 1:11 To figure out the answer, let's go back to the spreadsheet. 1:15 And let's start off by creating a new tab at the bottom and naming it Age Breakdown. 1:19 Then, to figure out where we should draw that line, 1:29 let's first find out how many ages took part in the race. 1:32 In Column A, let's add labels for youngest and oldest. 1:35 Then, in Column B1, let's set it equal to men and 1:42 then let's head over to the 2017 tab and select all the age data. 1:47 By clicking in cell C2 and using control, shift, down or 1:59 command, shift, down, then hit enter. 2:03 And there we go. 2:07 Now let's clean up that formula by using F4 to make those references absolute. 2:09 Then we can drag that down to oldest and replace min with max. 2:19 Perfect. 2:28 Next, to give us some idea of how big is too big, let's figure out 2:30 how many runners of each age there would be if ages were uniformly distributed. 2:35 [SOUND] So, if each age have the same number of runners, 2:40 how many runners would that be? 2:43 Now this is almost certainly not the case, 2:46 but it's easy to calculate and gives us a good jumping off point. 2:48 Below oldest, let's add a new label called runners per age. 2:52 And let's make this column just a little bit wider. 2:58 And let's say that equal to the total number 3:03 of runners from the summary tab divided by and 3:08 parenthesis oldest which I'll just write in B2 minus youngest 3:14 End the parentheses and hit Enter, great. 3:24 So in a uniform distribution, each age would account for about 400 runners. 3:27 From here, we just need to use this figure to decide how much of a difference 3:33 is acceptable between two consecutive ages. 3:37 I think 400 is probably too high and 100 is probably too low. 3:40 But between those two, it's difficult to say where we should end up. 3:48 We can only do so much in trying to figure things out. 3:51 At some point we just have to pick something. 3:55 So, let's go with 200, or about half of our runners per age. 3:57 Let's add a new label below runners per age called max difference. 4:02 And let's set it equal to runners per age divided by two. 4:11 Let's also format these two cells to have less decimal points by clicking on this 4:18 button up here. 4:23 And let's also bold our labels to make them easier to read. 4:28 In the next video, we'll dive into the aged data and see what we find. 4:34
You need to sign up for Treehouse in order to download course files.Sign up