Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Sometimes data is not obviously wrong or missing, but contains a more subtle error that can be found with knowledge of our data and how it was collected.
Data Files:
Data Source Information:
-
National Health and Nutrition Examination Survey (NHANES) main site
-
NHANES 1999-2000 information
-
Demographics Documentation/Codebook
-
Body Measurements Documentation/Codebook
-
Occupation Documentation/Codebook
Python and Pandas Resources:
Sometimes data isn't obviously wrong or
missing, it has a more subtle error.
0:00
Many of these errors can only be found
by using our knowledge of the data and
0:05
how it was collected.
0:08
We'll be continuing to
fix our example data set.
0:10
Hopefully you saved the changes you
made in the previous lessons, but
0:13
if not, the updated versions
are available in the teacher's notes.
0:16
There's also link to the code I'll be
using, if you'd like to download that.
0:20
Go ahead and pause this video while
you download any files you need and
0:24
setup your programming environment.
0:28
Ready, great.
0:30
We're going to start off by talking
about nonsensical or impossible data.
0:32
These are data entries that
are nonsensical or impossible, based on
0:36
the way that the data site was created or
the nature of what was recorded.
0:40
First, we want to load in the libraries
we'll need, numpy and pandas.
0:44
We'll also be working with all three
files again, so I'll load those as well.
0:50
Now, let's find some impossible data.
1:07
Weight is a good example for
impossible data.
1:10
We know a person can never
have a negative weight.
1:12
So let's print a description
of our weight column.
1:15
According to the summary
of the weight column,
1:21
the lowest recorded weight
is negative 149 kilograms.
1:23
We are not sure with the way we
are suppose to have been, so
1:27
we can just remove all
values less than zero.
1:30
First we find all rows with
the way less than zero.
1:33
Then we replace all those
values with not a number.
1:39
Now, we can view the description
of our data again.
1:47
Now our lowest weight is just over three
kilograms, a feasible value for a baby.
1:53
This is a good sanity check for
weight, but it's very simplistic.
1:58
After all,
we have all ages in our data set.
2:02
And a feasible weight range for
2:05
a two year old would be drastically
different than one for a 50 year old.
2:06
If we wanted to be more careful, we
could design more complicated checks and
2:10
take a participant's age into account.
2:14
More complicated checks are outside
the scope of this course.
2:16
But something to keep in mind as
you're cleaning your own data.
2:19
Categorical values can also
have impossible values.
2:23
Let's take a look at the possible
entries for weight comment.
2:26
The column for
2:29
weight comment as four possible values,
coded as one through four.
2:30
Any other entry in that
column is an error.
2:36
Let's take a look at the unique
entries in the weight comment column.
2:39
It looks like the numbers 7 and
11 have also been entered.
2:47
Since they aren't valid entries,
we need to remove them from our data set.
2:50
Again, we'll find all the invalid entries
and replace them with not a number.
2:54
Let's see what our new
data entries are now.
3:07
There we have it, only entries of
one through four, or not a number.
3:16
Often data entries are problematic
because they have extreme values.
3:21
According to Guinness World Records,
3:24
the heaviest person in the world weighed
635 kilograms at their heaviest.
3:26
Let's take another look
at our weight column.
3:31
Our data set has a maximum
weight of over 12,000 kilograms.
3:37
That's much higher than the world record.
3:41
Something must be wrong
with those entries.
3:43
Since they don't make sense,
let's remove all weights heavier than 635.
3:44
We can do the same thing as before.
3:50
Find all the rows with a weight above 635.
3:53
Then replace those rows with not a number.
3:58
And then finally we can
look at our data again.
4:05
Now the maximum weight is a much
more reasonable 193 kilograms.
4:10
Some analyses will also remove outliers,
or values too far from the average.
4:15
Values too far from normal
could bias the data set, or
4:20
there may be something wrong with
this data to cause those values.
4:23
On the other hand, if we don't have a good
reason to suspect a problem with them
4:27
these outliers may be valid,
although unusual, data entries.
4:31
In that case, removing them eliminates
important information from the data set.
4:35
Whether you remove outliers or not, will
depend on your particular circumstances.
4:40
One way to find outliers is to determine
how many standard deviations a value
4:44
is from the average for that group.
4:48
This is sometimes called the z-score.
4:50
To find the z-score of the maximum and
minimum weight value,
4:52
we first calculate the mean and
standard deviation in the entire column.
4:56
Note that we need to use nanmean and
nanstd so that the functions can deal with
5:00
the present of not a number
of values correctly.
5:05
Let's find the z-score for
these maximum and minimum weights.
5:09
Like with the mean and standard deviation,
we use nanmin and nanmax to find
5:13
the maximum and minimum values,
while ignoring not a number entries.
5:17
To get the z-score for the lowest weight,
5:29
we subtract the mean
from our minimum value.
5:31
Then we divide by the standard deviation.
5:37
We do the same thing to get
the z-score for the maximum weight.
5:40
And then let's print out
the z-scores we just calculated.
5:48
And I misspelled high as height.
6:14
There we go.
Our maximum weight is four standard
6:17
deviations from the mean.
6:20
And our minimum weight is one
standard deviation below the mean.
6:21
Finally, let's look for saturated data.
6:25
According to the code book,
all ages above 85 were recorded as 85.
6:28
This means that the age
value saturates at 85,
6:33
anything above that value is a mistake.
6:36
Also, we don't know the true
age of anyone recorded as 85,
6:38
only that they are at least 85.
6:42
If a person's exact age is important,
6:45
we may want to remove all saturated
entries or run a separate analysis for
6:47
people 85 and older,
that does not include age as a term.
6:52
In other cases,
6:56
grouping all those over 84 into a single
age may not affect the analysis.
6:57
Let's see what our maximum age is.
7:02
We have at least one entry
above the saturation limit,
7:11
with a maximum listed at 109.
7:14
If we feel confident that someone just
forgot to list ages at a maximum of 85,
7:16
we can replace all higher values with 85.
7:21
If we're concerned that some
other error caused the problem,
7:24
it may be better to remove
the values entirely.
7:27
Let's replace all values
above 85 with a value 85.
7:30
Remember to save all the changes
we've made to the data files.
7:44
That's it for
the video portion of this lesson.
8:11
Now it's time for you to try out
what you've learned on your own.
8:13
The teacher's notes have a link to
the practice notebook for this lesson.
8:16
There are two practice exercises for
you this time.
8:19
You'll need to check some of
the columns in the occupation file for
8:22
non-sensible values.
8:25
Then you'll find the z-score
of the maximum and
8:26
minimum values of column
OCQ180 in occupation.
8:29
Make sure you don't count values 7777,
or 9999, since those
8:33
are extra code values and
not actually the number of hours worked.
8:36
Remember to save your data
again when you're finished.
8:42
Afterwards, it's time for another quiz.
8:44
When you've completed the quiz,
8:46
we'll talk about ways to help
automate the data cleaning process.
8:48
You need to sign up for Treehouse in order to download course files.
Sign up