Join the Treehouse affiliate program and earn 25% recurring commission!

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Data Analysis Basics!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Defining Terms

4:38 with Ben Deitch

Before you start an analysis, you'll first want to define the question you're trying to answer.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

This video doesn't have any notes.

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Management typically starts the data analysis process 0:00

when they need to know something. 0:03

Let's work through this process by pretending the boss of an athletic 0:05

association has received complaints that some ages have an easier time qualifying 0:08

than others and they've tasked us with getting to the bottom of it. 0:12

The first we'll need to do is define the question. 0:16

Let's go with, do some ages have an easier time qualifying for the Boston Marathon? 0:19

Awesome. 0:25

Next, we need to turn that question into something concrete, 0:26

something we'll be able to answer with our data. 0:29

One way to find out if some ages have an easier time qualifying 0:32

is to compare the number of participants for each age. 0:36

If we find a big difference between two consecutive ages, 0:39

then we'll know something is up. 0:43

But this provides us with another issue. 0:44

How do we define a big difference? 0:47

Remember, we're trying to answer a yes or no question. 0:50

So at some point, we need to draw a line in the sand and 0:53

say, this difference is too much. 0:56

Analyzing data is all about asking questions. 0:58

You wanna approach each step and decision along the way with an inquisitive mind. 1:02

Always asking if things need clarification, or 1:06

could be better in anyway. 1:09

So, in this example, we're asking what's an appropriate difference. 1:11

To figure out the answer, let's go back to the spreadsheet. 1:15

And let's start off by creating a new tab at the bottom and naming it Age Breakdown. 1:19

Then, to figure out where we should draw that line, 1:29

let's first find out how many ages took part in the race. 1:32

In Column A, let's add labels for youngest and oldest. 1:35

Then, in Column B1, let's set it equal to men and 1:42

then let's head over to the 2017 tab and select all the age data. 1:47

By clicking in cell C2 and using control, shift, down or 1:59

command, shift, down, then hit enter. 2:03

And there we go. 2:07

Now let's clean up that formula by using F4 to make those references absolute. 2:09

Then we can drag that down to oldest and replace min with max. 2:19

Perfect. 2:28

Next, to give us some idea of how big is too big, let's figure out 2:30

how many runners of each age there would be if ages were uniformly distributed. 2:35

[SOUND] So, if each age have the same number of runners, 2:40

how many runners would that be? 2:43

Now this is almost certainly not the case, 2:46

but it's easy to calculate and gives us a good jumping off point. 2:48

Below oldest, let's add a new label called runners per age. 2:52

And let's make this column just a little bit wider. 2:58

And let's say that equal to the total number 3:03

of runners from the summary tab divided by and 3:08

parenthesis oldest which I'll just write in B2 minus youngest 3:14

End the parentheses and hit Enter, great. 3:24

So in a uniform distribution, each age would account for about 400 runners. 3:27

From here, we just need to use this figure to decide how much of a difference 3:33

is acceptable between two consecutive ages. 3:37

I think 400 is probably too high and 100 is probably too low. 3:40

But between those two, it's difficult to say where we should end up. 3:48

We can only do so much in trying to figure things out. 3:51

At some point we just have to pick something. 3:55

So, let's go with 200, or about half of our runners per age. 3:57

Let's add a new label below runners per age called max difference. 4:02

And let's set it equal to runners per age divided by two. 4:11

Let's also format these two cells to have less decimal points by clicking on this 4:18

button up here. 4:23

And let's also bold our labels to make them easier to read. 4:28

In the next video, we'll dive into the aged data and see what we find. 4:34

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up