Bummer! This is just a preview. You need to be signed in with a Basic account to view the entire video.
Start a free Basic trial
to watch this video
Sometimes a dataset may have the same information recorded multiple times. This duplicated data can cause problems with our analysis..
Key Terms
- Outlier - An example that is highly different from the majority of examples.
Data Files:
Data Source Information:
-
National Health and Nutrition Examination Survey (NHANES) main site
-
NHANES 1999-2000 information
-
Demographics Documentation/Codebook
-
Body Measurements Documentation/Codebook
-
Occupation Documentation/Codebook
Python and Pandas Resources:
-
0:00
In our last lesson, we talked about missing data in our dataset.
-
0:04
Now we're going to take a look at the opposite problem,
-
0:07
a dataset with duplicate information.
-
0:09
We want to make sure each example is only counted once.
-
0:13
Otherwise, our analysis could place too much importance
-
0:16
on the related examples since, from the computer's perspective,
-
0:20
those sets of values occur several times in our dataset.
-
0:24
We'll be looking for three types of duplicates.
-
0:26
Perfect duplicates, where the rows or columns are completely identical.
-
0:31
Split duplicates, where information is split between the duplicated rows or
-
0:35
columns.
-
0:36
And conflicting duplicates, where there are discrepancies between the data
-
0:39
contained in different rows or columns.
-
0:42
When dealing with duplicates, it's important to keep in mind
-
0:45
whether it's possible to have perfectly duplicated entries.
-
0:48
For example, we could have two people named Sam Smith who are 54 years old.
-
0:53
But it's less likely to have two people with the same name, address, and birthday.
-
0:58
We'll be using data files you saved after the last lesson.
-
1:02
But you can also download the exact version I'll be using
-
1:05
from the links in the teacher's notes.
-
1:07
There's also a link to the code I'll be using.
-
1:09
Go ahead and pause this video while you download any files you need and
-
1:13
set up your programming environment.
-
1:15
Once you're ready, let's get started.
-
1:19
Let's load in the Pandas and Numpy libraries.
-
1:26
And now we load in our data.
-
1:28
Let's load in all three data files this time.
-
1:46
Let's look for any duplicated rows in demographics dataframe.
-
1:50
We know the sequence column should have a unique number for each person.
-
1:53
So let's see if it has any duplicates.
-
1:55
First, we sort the dataframe so
-
1:57
that the numbers in the sequence column are in ascending order.
-
2:00
This isn't necessary, but it does make the data easier to look at.
-
2:06
Now we use the Panda's duplicated method to find all rows where the sequence column
-
2:10
is duplicated.
-
2:12
Setting the argument keep to be false tells it to mark all duplicated rows.
-
2:16
We'll store these in the variable ind for index.
-
2:22
Now let's print out the number of duplicated rows.
-
2:25
Let's also print the first few rows of all the columns.
-
2:47
We have 1213 duplicated rows in our dataset.
-
2:53
We can see the first five here,
-
2:54
but there's far too much data to go through everything by hand.
-
2:58
Lets start with the simplest case, getting rid of any pure duplicates.
-
3:02
Pandas has a drop duplicates method for
-
3:04
dataframes that will drop all rows where all entries are identical.
-
3:08
In our case,
-
3:09
we know each entry should have a unique value in the sequence column.
-
3:13
So it's safe to drop all identical duplicates.
-
3:21
Now let's take another look at how many repeated
-
3:25
entries we have by copying the code we used earlier.
-
3:40
Now we're down to 814 duplicates.
-
3:44
Next, we'll look for examples where an entry is split across multiple rows.
-
3:48
One example of how this could happen is if data is collected on multiple days and
-
3:53
the data entry process starts a new entry every time new data is entered.
-
3:57
We want to use the other duplicated rows to try and fill in all the missing values.
-
4:02
If this creates a perfect match across all the rows,
-
4:05
we can then drop all but one of them.
-
4:07
To do this we use the itertools library for
-
4:10
its useful product function, which we'll need in a moment.
-
4:13
Now wevneed a unique list of all IDs that are duplicated.
-
4:17
We can get this by selecting only the duplicated rows in the SEQN column.
-
4:22
And then using the unique method to show each ID only once.
-
4:27
We want to iterate over all duplicate IDs, so we use a for loop.
-
4:36
We need to get the location of all rows that match our current ID number.
-
4:50
And here is where that product function comes in.
-
4:53
We pass in our duplicated rows and specify the repeat parameter to be 2.
-
4:58
It will return all possible pairs of these rows.
-
5:02
This is almost identical to two nested for
-
5:04
loops, where each loop's over every possible row.
-
5:08
Except that it skips the cases where both rows are the same.
-
5:12
Now for each of pair rows,
-
5:13
we want to fill in the missing values of one row with the values of the other.
-
5:17
We can do this with the dataframe's fillna method.
-
5:20
It will fill in all not a number entries from the first row with the data from
-
5:24
the row passed as an argument.
-
5:28
Now let's drop all our duplicates again, and see how many are left.
-
5:59
We're down to 416 duplicated rows.
-
6:03
All remaining duplicates should belong to our third case,
-
6:05
they have conflicting information.
-
6:07
We can see an example of this in the duplications for ID 60,
-
6:12
the values for column RIDEXMON don't match.
-
6:16
We have no way of knowing which value is correct, so
-
6:18
it's safest to drop all rows with conflicting information.
-
6:22
So first, we need to get all duplicated rows.
-
6:30
Then we remove them from our dataset by selecting only
-
6:33
those that aren't duplicated.
-
7:03
And after printing our duplicated rows,
-
7:05
we can see that we've gotten rid of the last of them.
-
7:08
It's also possible to have conflicting columns.
-
7:11
Here, we want to make sure that the only duplicated column between our
-
7:14
demographics and body measures dataframes is the sequence column.
-
7:17
If we cast a list of columns in demo to a set,
-
7:20
we can find the intersection of it and the bmx dataframe.
-
7:23
An intersection tell us which values are present in two sets.
-
7:29
It looks like our sequence column is in both data frames, but so
-
7:32
is the column for age.
-
7:34
The code books say that age should only be recorded in the demographics file.
-
7:38
So let's drop it from the body measures data frame.
-
7:51
Great, now the only column in both dataframes is the sequence column.
-
7:55
Now, we can merge these two datasets into a single dataframe using the merge
-
7:59
function.
-
8:07
And as before, we wanna save any changes we've made to our dataset.
-
8:37
Before we move on, there are some more practice exercises for you.
-
8:40
The teacher's notes have a link to the practice notebook for this lesson.
-
8:44
You'll be fixing any duplicate rows or columns in the occupation file.
-
8:47
And then you'll be merging all three files into a single dataset.
-
8:50
Don't forget to save your files again when you finish.
-
8:53
Once you've completed these exercises, you'll be ready for the next lesson.
-
8:57
In our next lesson, we'll cover a couple more data issues, infeasible data and
-
9:01
extreme data.
You need to sign up for Treehouse in order to download course files.
Sign up