**Heads up!** To view this whole video, sign in with your Courses account or enroll in your free 7-day trial.
Sign In
Enroll

Preview

Start a free Courses trial

to watch this video

Before you start an analysis, you'll first want to define the question you're trying to answer.

This video doesn't have any notes.

**Related Discussions**

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up**Related Discussions**

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Management typically starts
the data analysis process
0:00

when they need to know something.
0:03

Let's work through this process by
pretending the boss of an athletic
0:05

association has received complaints that
some ages have an easier time qualifying
0:08

than others and they've tasked us
with getting to the bottom of it.
0:12

The first we'll need to do
is define the question.
0:16

Let's go with, do some ages have an easier
time qualifying for the Boston Marathon?
0:19

Awesome.
0:25

Next, we need to turn that
question into something concrete,
0:26

something we'll be able
to answer with our data.
0:29

One way to find out if some ages
have an easier time qualifying
0:32

is to compare the number of
participants for each age.
0:36

If we find a big difference
between two consecutive ages,
0:39

then we'll know something is up.
0:43

But this provides us with another issue.
0:44

How do we define a big difference?
0:47

Remember, we're trying to answer a yes or
no question.
0:50

So at some point,
we need to draw a line in the sand and
0:53

say, this difference is too much.
0:56

Analyzing data is all
about asking questions.
0:58

You wanna approach each step and decision
along the way with an inquisitive mind.
1:02

Always asking if things
need clarification, or
1:06

could be better in anyway.
1:09

So, in this example, we're asking
what's an appropriate difference.
1:11

To figure out the answer,
let's go back to the spreadsheet.
1:15

And let's start off by creating a new tab
at the bottom and naming it Age Breakdown.
1:19

Then, to figure out where
we should draw that line,
1:29

let's first find out how many
ages took part in the race.
1:32

In Column A, let's add labels for
youngest and oldest.
1:35

Then, in Column B1,
let's set it equal to men and
1:42

then let's head over to the 2017 tab and
select all the age data.
1:47

By clicking in cell C2 and
using control, shift, down or
1:59

command, shift, down, then hit enter.
2:03

And there we go.
2:07

Now let's clean up that formula by using
F4 to make those references absolute.
2:09

Then we can drag that down to oldest and
replace min with max.
2:19

Perfect.
2:28

Next, to give us some idea of how
big is too big, let's figure out
2:30

how many runners of each age there would
be if ages were uniformly distributed.
2:35

[SOUND] So, if each age have
the same number of runners,
2:40

how many runners would that be?
2:43

Now this is almost certainly not the case,
2:46

but it's easy to calculate and
gives us a good jumping off point.
2:48

Below oldest, let's add a new
label called runners per age.
2:52

And let's make this column
just a little bit wider.
2:58

And let's say that equal
to the total number
3:03

of runners from the summary
tab divided by and
3:08

parenthesis oldest which I'll
just write in B2 minus youngest
3:14

End the parentheses and hit Enter, great.
3:24

So in a uniform distribution, each age
would account for about 400 runners.
3:27

From here, we just need to use this
figure to decide how much of a difference
3:33

is acceptable between
two consecutive ages.
3:37

I think 400 is probably too high and
100 is probably too low.
3:40

But between those two, it's difficult
to say where we should end up.
3:48

We can only do so
much in trying to figure things out.
3:51

At some point we just
have to pick something.
3:55

So, let's go with 200, or
about half of our runners per age.
3:57

Let's add a new label below runners
per age called max difference.
4:02

And let's set it equal to
runners per age divided by two.
4:11

Let's also format these two cells to have
less decimal points by clicking on this
4:18

button up here.
4:23

And let's also bold our labels
to make them easier to read.
4:28

In the next video, we'll dive into
the aged data and see what we find.
4:34

You need to sign up for Treehouse in order to download course files.

Sign upYou need to sign up for Treehouse in order to set up Workspace

Sign up