Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Now we will move beyond the simple data issues and look at instances of missing data.
Key Terms:
- Permutation - Every possible ordering of elements in a list, without repeats.
Data Files:
Data Source Information:
-
National Health and Nutrition Examination Survey (NHANES) main site
-
NHANES 1999-2000 information
-
Demographics Documentation/Codebook
-
Body Measurements Documentation/Codebook
-
Occupation Documentation/Codebook
Python and Pandas Resources:
In our last lesson, we talked about
fixing or removing data from our dataset.
0:00
Now we're going to use
our practice dataset for
0:05
some real examples of finding and
dealing with missing data.
0:07
We'll be using the version of the files
you saved at the end of the first video.
0:11
But you can also download the version
I'll be using from the links in
0:15
the teacher's notes.
0:18
This also links the code I'll be using.
0:20
Go ahead and pause this video while
you download any files you need and
0:22
set up your programming environment.
0:26
Ready?
0:28
Great, let's look at our missing data.
0:29
Some data may already be missing while
others maybe be removed later due
0:31
issues you uncover during
the data cleaning process.
0:36
In this lesson, we use both the pandas and
numpy libraries.
0:39
And now we load in our data.
0:48
In this lesson, we use both
the Demographics, and BodyMeasures data.
0:50
Now, let's take a look
at our missing data.
1:05
Some of the first questions you might ask,
1:08
are how many elements
are missing from each column?
1:10
And what percentage of the data
is missing from each column?
1:13
We can get the number of
good values in each column
1:16
using the count method on our data frame.
1:18
We can get the total number of rows
by finding the length of the index.
1:22
Subtracting those two numbers will get us
the total number of missing elements in
1:30
each row.
1:34
There are a lot of columns, but
1:37
we can use the head method
to show only the first few.
1:39
We can display the amount of missing data
as a percentage by dividing by the total
1:42
number of rows, then multiplying by 100.
1:47
We can also look at how much
information is missing from each row.
1:58
Using the DataFrames' isnull method,
2:02
we can add up the number of
missing entries in each row.
2:04
Now we'll use the len function
to get the number of columns.
2:15
Dividing the number of missing elements
in each row by the number of columns then
2:25
multiplying by 100 gives us the percentage
of data missing from each of the rows.
2:29
As we mentioned in the previous lesson,
any rows or columns with
2:43
a large number of missing information will
probably need to be fixed or removed.
2:46
But remember, these are only
the entrance mark as not a number.
2:51
There may be other entries
we also want to remove.
2:54
Let's take a closer look at the column for
students datas.
2:58
According to the code book
valid answers are, in school.
3:02
On vacation from school,
between grades, neither in school.
3:06
Or on vacation from school,
between grades.
3:10
Refused, and don't know.
3:13
The last two responses, refused and
don't Know, are not missing entries.
3:16
But in most cases,
3:21
they aren't meaningfully
different from truly missing data.
3:22
We can remove those
values from our dataset
3:25
if we want all of our missing
data represented by not a number.
3:27
Going back to our code, let's use
the same method we used before to get
3:31
the percentage of missing data
in the student status column.
3:35
About 67% of the data is missing.
4:05
We can get the number of refused answers
by taking a sum of the number of entries
4:08
equal to 7.
4:12
And for don't know responses,
we add up the number of entries equal to 9
4:19
Now let's print out our results.
4:30
There are no refused answers, but
2 who responded that they don't know.
4:54
Let's remove these entries so
4:58
that all missing information in this
column is marked as not a number.
5:00
First, find all entries in that column
greater than 3 since 3 is the largest code
5:04
that contains meaningful information.
5:08
And we set all values in that column
at those indices to be NumPy dot not
5:16
a number.
5:21
And we look at all the unique
values in that column,
5:33
we should see not a number 1, 2, or 3.
5:36
And there we have it.
5:45
No entries for 7 or 9.
5:46
Let's re-run our code that shows
how much information is missing.
5:49
We don't have to change anything so
we can just copy and paste it from before.
5:52
Those two entries didn't increase
percentage of missing data by much.
6:20
We can verify again that there are no
more entries of 7 or 9 in that column.
6:24
We also need to remove incorrect entries
where the wrong type of data has
6:29
been entered.
6:33
Let's look at the data type for the first
few demographics called demo, and
6:34
body measures called BMX.
6:39
Before printing out the first few columns,
6:41
let's have Python print the name of the
data file so our output is easier to read.
6:43
Then we can look at the d-types properly
to see the type of each column.
6:47
We're only interested in
the first few columns so
6:51
we use the head method
to limit the display.
6:54
And we do the same for
our body measures data.
6:58
The sequence column should exist in
all files using the same number for
7:10
each participant across the files.
7:14
But in demographics it's
stored as integer data.
7:16
While in body measures it's stored
as an object or string type.
7:19
Pandas has a to numeric function that will
convert a data frame to numeric values.
7:22
Let's use it to convert the sequence
column in body measures to integer data.
7:27
We use the downcast argument to specify,
7:37
that we want our data
converted to integers.
7:39
But when we run it, we get an error.
7:45
It looks like someone
entered the person's name,
7:50
Jane Doe, instead of their ID number.
7:52
We had an error because Pandas doesn't
know how to convert a name into a number.
7:55
The two numeric function has
another argument called errors
8:00
that tells it how to handle any errors
that runs into while converting the data.
8:03
If we tell it to coarse the error,
8:07
it will insert not a number
anywhere that the conversion fails.
8:09
No errors this time.
8:18
Let's see what the type is now.
8:20
Here's an example of why it's always good
to double check that your code is doing
8:27
what you expect.
8:30
The type still isn't correct.
8:32
It's not longer an object type, but
8:34
now it's floating point instead
of the integer values we wanted.
8:36
This is because the not a number entries
are considered floats and not integers.
8:39
If this were only data file, we could
probably generate a new ID number for
8:44
each of the participants.
8:49
As long as the other data in
those rows were still okay.
8:50
But here,
8:53
the values in the sequence column
need to match across different files.
8:54
And we have no way of knowing which
ID number they should have had.
8:58
We can remove all rows
with invalid IDs and
9:01
then we'll be able to convert
the column to integer values.
9:04
First, we use NumPy's as NaN function to
get a Boolean series telling us whether
9:07
each element in the column
is not a number.
9:11
We can drop those rows by selecting
the inverse of the Boolean values.
9:18
And now, let's try converting
the column to integers again.
9:29
There we have it,
the column is integer data now.
10:00
We could also remove any data containing
sensitive information, like full names,
10:03
social security numbers,
or credit information.
10:08
In this server, we have marital status for
everyone aged 14 and above.
10:11
But we may want to remove that information
for everyone considered a minor, which for
10:15
our purposes means anyone under 18.
10:19
So first, lets locate all
rows with the age under 18.
10:22
Using the count method, we can see
how many entries we have for minors.
10:35
It looks like there are 1,244 entries for
marital status for minors.
10:52
To get rid of these data,
replace all these entries with
10:57
not a number then print out
the new count for those rows.
11:00
Now there are no entries in the marital
status column for anyone under age 18.
11:25
Finally, remember to save your modified
data set using the to_csv method, so
11:30
you can use it in future lessons.
11:35
Great work, we've just learned a lot
about missing data in our dataset.
11:49
Before moving on,
I have more practice exercises for you.
11:53
The teacher's notes have a link to
the practice notebook for this lesson.
11:56
This notebook has instructions for
two practice exercises.
12:00
First, identify the column in demographics
12:03
with the highest percentage
of missing information.
12:06
Second, remove all the pregnancy status
information for minors like we did for
12:09
their marital status.
12:13
Don't forget to save your data
again when you're finished.
12:14
When you're finished
take a moment to relax.
12:17
We have just covered a lot of information.
12:19
There's a lot that goes into
dealing with missing data.
12:22
In the next video, we'll be looking at
what to do when we have the same data
12:24
recorded in our data set multiple times.
12:28
You need to sign up for Treehouse in order to download course files.
Sign up