Duplicated Data9:02 with Alyssa Batula
Sometimes a dataset may have the same information recorded multiple times. This duplicated data can cause problems with our analysis..
- Outlier - An example that is highly different from the majority of examples.
Data Source Information:
National Health and Nutrition Examination Survey (NHANES) main site
NHANES 1999-2000 information
Body Measurements Documentation/Codebook
Python and Pandas Resources:
In our last lesson, we talked about missing data in our dataset. 0:00 Now we're going to take a look at the opposite problem, 0:04 a dataset with duplicate information. 0:07 We want to make sure each example is only counted once. 0:09 Otherwise, our analysis could place too much importance 0:13 on the related examples since, from the computer's perspective, 0:16 those sets of values occur several times in our dataset. 0:20 We'll be looking for three types of duplicates. 0:24 Perfect duplicates, where the rows or columns are completely identical. 0:26 Split duplicates, where information is split between the duplicated rows or 0:31 columns. 0:35 And conflicting duplicates, where there are discrepancies between the data 0:36 contained in different rows or columns. 0:39 When dealing with duplicates, it's important to keep in mind 0:42 whether it's possible to have perfectly duplicated entries. 0:45 For example, we could have two people named Sam Smith who are 54 years old. 0:48 But it's less likely to have two people with the same name, address, and birthday. 0:53 We'll be using data files you saved after the last lesson. 0:58 But you can also download the exact version I'll be using 1:02 from the links in the teacher's notes. 1:05 There's also a link to the code I'll be using. 1:07 Go ahead and pause this video while you download any files you need and 1:09 set up your programming environment. 1:13 Once you're ready, let's get started. 1:15 Let's load in the Pandas and Numpy libraries. 1:19 And now we load in our data. 1:26 Let's load in all three data files this time. 1:28 Let's look for any duplicated rows in demographics dataframe. 1:46 We know the sequence column should have a unique number for each person. 1:50 So let's see if it has any duplicates. 1:53 First, we sort the dataframe so 1:55 that the numbers in the sequence column are in ascending order. 1:57 This isn't necessary, but it does make the data easier to look at. 2:00 Now we use the Panda's duplicated method to find all rows where the sequence column 2:06 is duplicated. 2:10 Setting the argument keep to be false tells it to mark all duplicated rows. 2:12 We'll store these in the variable ind for index. 2:16 Now let's print out the number of duplicated rows. 2:22 Let's also print the first few rows of all the columns. 2:25 We have 1213 duplicated rows in our dataset. 2:47 We can see the first five here, 2:53 but there's far too much data to go through everything by hand. 2:54 Lets start with the simplest case, getting rid of any pure duplicates. 2:58 Pandas has a drop duplicates method for 3:02 dataframes that will drop all rows where all entries are identical. 3:04 In our case, 3:08 we know each entry should have a unique value in the sequence column. 3:09 So it's safe to drop all identical duplicates. 3:13 Now let's take another look at how many repeated 3:21 entries we have by copying the code we used earlier. 3:25 Now we're down to 814 duplicates. 3:40 Next, we'll look for examples where an entry is split across multiple rows. 3:44 One example of how this could happen is if data is collected on multiple days and 3:48 the data entry process starts a new entry every time new data is entered. 3:53 We want to use the other duplicated rows to try and fill in all the missing values. 3:57 If this creates a perfect match across all the rows, 4:02 we can then drop all but one of them. 4:05 To do this we use the itertools library for 4:07 its useful product function, which we'll need in a moment. 4:10 Now wevneed a unique list of all IDs that are duplicated. 4:13 We can get this by selecting only the duplicated rows in the SEQN column. 4:17 And then using the unique method to show each ID only once. 4:22 We want to iterate over all duplicate IDs, so we use a for loop. 4:27 We need to get the location of all rows that match our current ID number. 4:36 And here is where that product function comes in. 4:50 We pass in our duplicated rows and specify the repeat parameter to be 2. 4:53 It will return all possible pairs of these rows. 4:58 This is almost identical to two nested for 5:02 loops, where each loop's over every possible row. 5:04 Except that it skips the cases where both rows are the same. 5:08 Now for each of pair rows, 5:12 we want to fill in the missing values of one row with the values of the other. 5:13 We can do this with the dataframe's fillna method. 5:17 It will fill in all not a number entries from the first row with the data from 5:20 the row passed as an argument. 5:24 Now let's drop all our duplicates again, and see how many are left. 5:28 We're down to 416 duplicated rows. 5:59 All remaining duplicates should belong to our third case, 6:03 they have conflicting information. 6:05 We can see an example of this in the duplications for ID 60, 6:07 the values for column RIDEXMON don't match. 6:12 We have no way of knowing which value is correct, so 6:16 it's safest to drop all rows with conflicting information. 6:18 So first, we need to get all duplicated rows. 6:22 Then we remove them from our dataset by selecting only 6:30 those that aren't duplicated. 6:33 And after printing our duplicated rows, 7:03 we can see that we've gotten rid of the last of them. 7:05 It's also possible to have conflicting columns. 7:08 Here, we want to make sure that the only duplicated column between our 7:11 demographics and body measures dataframes is the sequence column. 7:14 If we cast a list of columns in demo to a set, 7:17 we can find the intersection of it and the bmx dataframe. 7:20 An intersection tell us which values are present in two sets. 7:23 It looks like our sequence column is in both data frames, but so 7:29 is the column for age. 7:32 The code books say that age should only be recorded in the demographics file. 7:34 So let's drop it from the body measures data frame. 7:38 Great, now the only column in both dataframes is the sequence column. 7:51 Now, we can merge these two datasets into a single dataframe using the merge 7:55 function. 7:59 And as before, we wanna save any changes we've made to our dataset. 8:07 Before we move on, there are some more practice exercises for you. 8:37 The teacher's notes have a link to the practice notebook for this lesson. 8:40 You'll be fixing any duplicate rows or columns in the occupation file. 8:44 And then you'll be merging all three files into a single dataset. 8:47 Don't forget to save your files again when you finish. 8:50 Once you've completed these exercises, you'll be ready for the next lesson. 8:53 In our next lesson, we'll cover a couple more data issues, infeasible data and 8:57 extreme data. 9:01
You need to sign up for Treehouse in order to download course files.Sign up