Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
In this video, we will use the provided dataset to find and fix some common problems in the dataset.
New Terms:
- Jupyter Notebook -- Web application for creating documents containing live code and explanatory text.
Data Files:
Scroll to the bottom of the Download File to access the "Individual Practice" and the three tasks associated.
Data Source Information:
-
National Health and Nutrition Examination Survey (NHANES) main site
-
NHANES 1999-2000 information
-
Demographics Documentation/Codebook
-
Body Measurements Documentation/Codebook
-
Occupation Documentation/Codebook
Python and Pandas Resources:
[MUSIC]
0:00
Now that we know a little bit about our
dataset and the data cleaning process,
0:04
let's take a closer look at some common
issues using our example dataset.
0:08
We'll be getting hands-on
practice with our dataset.
0:13
Starting with some simple yet
0:16
important issues you may find
during the data cleaning process.
0:17
We'll be using the same data sets
we downloaded in the last stage.
0:21
You can also find the link to download
them again in the teacher's notes.
0:25
There are also links to the code books for
the original data sets
0:29
as well as a Jupiter notebook with
the code I'll be using in this video.
0:32
Feel free to follow along using whatever
programming environment you like best.
0:36
Take a moment to pause this video and
make sure that you have the data files,
0:40
open the dataset codebooks, and set up
your environment the way you want it.
0:45
Remember, you can slow down or
pause the video at any point
0:49
if you need more time to go over
some code or finish typing.
0:53
First, let's load in the pandas library,
and call it pd for short.
0:57
Now, we can use pandas to
load in our data files.
1:03
Make sure you are working in
the same directory as your files.
1:06
Once we have our data, let's take a look
at the type of data stored in each column.
1:11
There are a lot of columns
in our demo data frame.
1:15
So we'll just look at a few of them for
now, sequence, age,
1:18
gender, and military status.
1:25
We can find the data type of each
row using panda's dtype property.
1:32
We can see that the sequence
column is stored as an integer.
1:37
Age is stored as a float or
decimal number.
1:40
The remaining columns are all objects,
which is what pandas call strings.
1:43
We can used panda's unique method to look
at all the unique entries in a column.
1:48
Let's print the length of the unique
elements of the column for
1:53
military status coded as DMQMILIT.
1:57
From our output, we can see that there
are 38 unique entries in this column.
2:05
That sounds like a lot
of different options.
2:10
So let's see what they are.
2:12
This time,
let's print all of the unique values.
2:13
We can see that many of the different
entries should actually be the same value.
2:21
There are just extra
spaces at the beginning or
2:25
end of the entry that make
them appear different.
2:27
For example, here we have the word
no with two spaces in front of it.
2:30
And here we have No with a single space
in front of it, both as separate entries.
2:34
Both values mean the same thing, so we
want to combine them into the same value.
2:39
We can use this using the data
frame's strip method.
2:44
This will remove all that extra
white space from the entire column.
2:47
Now that we've gotten rid of the white
space, let's print out the number and
3:03
values of the unique entries again.
3:06
We've gotten rid of many of the extra
entries, but it's still not quite right.
3:17
Sometimes yes and no are written out,
3:21
while other times only
the first letter is used.
3:23
Let's replace all entries using
a single letter to use the entire word.
3:28
Pandas also has a handy replace method
that we can pass a set of nested
3:32
dictionaries, or
a dictionary within a dictionary.
3:36
Telling it what we want to replace.
3:39
The outer dictionary's keys tells
pandas which column to look at.
3:41
Here we pasted the code for
the military status column.
3:45
The value for
that key is another dictionary.
3:53
This interdictionary has the data
entry values to be replaced as keys.
3:56
The value for each key is what
we want to replace the key with.
4:01
So we use Y for the key and yes for
4:05
the value in order to replace all
of the entries of Y with yes.
4:07
Note that we only specify
values that we want to change.
4:12
Next, we pass this dictionary
into the replace method.
4:16
We also set the inplace
argument to true so
4:21
that it will modify our demo data frame.
4:23
If we didn't specify that,
4:26
it would return a new data frame instead
of changing the one we already have.
4:28
Now let's print the number and contents of
all unique entries for that column again.
4:32
After running this code, we can see that
we are down to the five expected values.
4:43
Not a number, for missing data,
yes, no, don't know, and refused.
4:48
Let's do this again for
the citizenship status column.
4:53
According to the code book,
there should be five unique values,
5:12
citizen by birth or naturalization,
Not a citizen of the US.
5:16
Refused, don't no, or missing,
which is listed as not a number.
5:19
This one has 31 different entries.
5:24
Unlike the last column, this one has
a lot of white space issues too.
5:27
We can use the same method to remove
the extra white space for this column.
5:31
And then see how many
unique values are left.
5:47
We still have eight values left, and
5:56
it looks like there are some
typos with the Don't know entry.
5:58
One or more are missing the apostrophe,
6:01
and there's also at least one
that capitalize the K in Know.
6:03
Someone also used the word unknown instead
of don't know when entering the data.
6:08
Let's fix these typos
with a replace method.
6:13
First, we need to create a new dictionary
telling pandas to replace all those typos
6:15
with the words don't know
in the citizenship column.
6:20
Then we pass this new dictionary
into the replace method.
6:43
And look at our unique entries for
that column.
6:54
Success.
We now only have the five entries we're
7:04
looking for.
7:06
We could also have fixed both columns
at once, by passing the replace method
7:07
a larger dictionary,
with information for both columns.
7:10
Finally, remember to save your modified
data set using the to csv method,
7:34
so you can use it in future lessons.
7:39
Now it's your turn.
7:46
We've fixed white space issues and
typos in two of our columns.
7:48
But those aren't the only
ones in our data set.
7:52
There's a link to a practice
notebook in the teacher's notes.
7:54
This notebook has instructions for
three practice exercises.
7:57
First, find all of the columns
in the demographics file
8:01
that contain string data.
8:04
Second, get rid of all the excess
white space in these columns, and
8:06
finally, get rid of any typos and
inconsistencies, so
8:09
that all the string entries match
the ones listed in the code book.
8:13
Once you finish,
remember to save your data again.
8:16
You can see what the text entry should
be for each column by looking for
8:19
the column key, then checking the entry
keys under value description.
8:23
That is it for this video.
8:27
Once you finish these practice exercises,
you will be ready for the next lesson.
8:28
We will look at more ways to
use the replacement method
8:32
to format our data the way we want it.
8:35
You need to sign up for Treehouse in order to download course files.
Sign up