Bummer! This is just a preview. You need to be signed in with a Basic account to view the entire video.
Start a free Basic trial
to watch this video
Sometimes data is not obviously wrong or missing, but contains a more subtle error that can be found with knowledge of our data and how it was collected.
Data Files:
Data Source Information:

National Health and Nutrition Examination Survey (NHANES) main site

NHANES 19992000 information

Demographics Documentation/Codebook

Body Measurements Documentation/Codebook

Occupation Documentation/Codebook
Python and Pandas Resources:

0:00
Sometimes data isn't obviously wrong or missing, it has a more subtle error.

0:05
Many of these errors can only be found by using our knowledge of the data and

0:08
how it was collected.

0:10
We'll be continuing to fix our example data set.

0:13
Hopefully you saved the changes you made in the previous lessons, but

0:16
if not, the updated versions are available in the teacher's notes.

0:20
There's also link to the code I'll be using, if you'd like to download that.

0:24
Go ahead and pause this video while you download any files you need and

0:28
setup your programming environment.

0:30
Ready, great.

0:32
We're going to start off by talking about nonsensical or impossible data.

0:36
These are data entries that are nonsensical or impossible, based on

0:40
the way that the data site was created or the nature of what was recorded.

0:44
First, we want to load in the libraries we'll need, numpy and pandas.

0:50
We'll also be working with all three files again, so I'll load those as well.

1:07
Now, let's find some impossible data.

1:10
Weight is a good example for impossible data.

1:12
We know a person can never have a negative weight.

1:15
So let's print a description of our weight column.

1:21
According to the summary of the weight column,

1:23
the lowest recorded weight is negative 149 kilograms.

1:27
We are not sure with the way we are suppose to have been, so

1:30
we can just remove all values less than zero.

1:33
First we find all rows with the way less than zero.

1:39
Then we replace all those values with not a number.

1:47
Now, we can view the description of our data again.

1:53
Now our lowest weight is just over three kilograms, a feasible value for a baby.

1:58
This is a good sanity check for weight, but it's very simplistic.

2:02
After all, we have all ages in our data set.

2:05
And a feasible weight range for

2:06
a two year old would be drastically different than one for a 50 year old.

2:10
If we wanted to be more careful, we could design more complicated checks and

2:14
take a participant's age into account.

2:16
More complicated checks are outside the scope of this course.

2:19
But something to keep in mind as you're cleaning your own data.

2:23
Categorical values can also have impossible values.

2:26
Let's take a look at the possible entries for weight comment.

2:29
The column for

2:30
weight comment as four possible values, coded as one through four.

2:36
Any other entry in that column is an error.

2:39
Let's take a look at the unique entries in the weight comment column.

2:47
It looks like the numbers 7 and 11 have also been entered.

2:50
Since they aren't valid entries, we need to remove them from our data set.

2:54
Again, we'll find all the invalid entries and replace them with not a number.

3:07
Let's see what our new data entries are now.

3:16
There we have it, only entries of one through four, or not a number.

3:21
Often data entries are problematic because they have extreme values.

3:24
According to Guinness World Records,

3:26
the heaviest person in the world weighed 635 kilograms at their heaviest.

3:31
Let's take another look at our weight column.

3:37
Our data set has a maximum weight of over 12,000 kilograms.

3:41
That's much higher than the world record.

3:43
Something must be wrong with those entries.

3:44
Since they don't make sense, let's remove all weights heavier than 635.

3:50
We can do the same thing as before.

3:53
Find all the rows with a weight above 635.

3:58
Then replace those rows with not a number.

4:05
And then finally we can look at our data again.

4:10
Now the maximum weight is a much more reasonable 193 kilograms.

4:15
Some analyses will also remove outliers, or values too far from the average.

4:20
Values too far from normal could bias the data set, or

4:23
there may be something wrong with this data to cause those values.

4:27
On the other hand, if we don't have a good reason to suspect a problem with them

4:31
these outliers may be valid, although unusual, data entries.

4:35
In that case, removing them eliminates important information from the data set.

4:40
Whether you remove outliers or not, will depend on your particular circumstances.

4:44
One way to find outliers is to determine how many standard deviations a value

4:48
is from the average for that group.

4:50
This is sometimes called the zscore.

4:52
To find the zscore of the maximum and minimum weight value,

4:56
we first calculate the mean and standard deviation in the entire column.

5:00
Note that we need to use nanmean and nanstd so that the functions can deal with

5:05
the present of not a number of values correctly.

5:09
Let's find the zscore for these maximum and minimum weights.

5:13
Like with the mean and standard deviation, we use nanmin and nanmax to find

5:17
the maximum and minimum values, while ignoring not a number entries.

5:29
To get the zscore for the lowest weight,

5:31
we subtract the mean from our minimum value.

5:37
Then we divide by the standard deviation.

5:40
We do the same thing to get the zscore for the maximum weight.

5:48
And then let's print out the zscores we just calculated.

6:14
And I misspelled high as height.

6:17
There we go. Our maximum weight is four standard

6:20
deviations from the mean.

6:21
And our minimum weight is one standard deviation below the mean.

6:25
Finally, let's look for saturated data.

6:28
According to the code book, all ages above 85 were recorded as 85.

6:33
This means that the age value saturates at 85,

6:36
anything above that value is a mistake.

6:38
Also, we don't know the true age of anyone recorded as 85,

6:42
only that they are at least 85.

6:45
If a person's exact age is important,

6:47
we may want to remove all saturated entries or run a separate analysis for

6:52
people 85 and older, that does not include age as a term.

6:56
In other cases,

6:57
grouping all those over 84 into a single age may not affect the analysis.

7:02
Let's see what our maximum age is.

7:11
We have at least one entry above the saturation limit,

7:14
with a maximum listed at 109.

7:16
If we feel confident that someone just forgot to list ages at a maximum of 85,

7:21
we can replace all higher values with 85.

7:24
If we're concerned that some other error caused the problem,

7:27
it may be better to remove the values entirely.

7:30
Let's replace all values above 85 with a value 85.

7:44
Remember to save all the changes we've made to the data files.

8:11
That's it for the video portion of this lesson.

8:13
Now it's time for you to try out what you've learned on your own.

8:16
The teacher's notes have a link to the practice notebook for this lesson.

8:19
There are two practice exercises for you this time.

8:22
You'll need to check some of the columns in the occupation file for

8:25
nonsensible values.

8:26
Then you'll find the zscore of the maximum and

8:29
minimum values of column OCQ180 in occupation.

8:33
Make sure you don't count values 7777, or 9999, since those

8:36
are extra code values and not actually the number of hours worked.

8:42
Remember to save your data again when you're finished.

8:44
Afterwards, it's time for another quiz.

8:46
When you've completed the quiz,

8:48
we'll talk about ways to help automate the data cleaning process.
You need to sign up for Treehouse in order to download course files.
Sign up