Missing Data12:31 with Alyssa Batula
Now we will move beyond the simple data issues and look at instances of missing data.
- Permutation - Every possible ordering of elements in a list, without repeats.
Data Source Information:
National Health and Nutrition Examination Survey (NHANES) main site
NHANES 1999-2000 information
Body Measurements Documentation/Codebook
Python and Pandas Resources:
In our last lesson, we talked about fixing or removing data from our dataset. 0:00 Now we're going to use our practice dataset for 0:05 some real examples of finding and dealing with missing data. 0:07 We'll be using the version of the files you saved at the end of the first video. 0:11 But you can also download the version I'll be using from the links in 0:15 the teacher's notes. 0:18 This also links the code I'll be using. 0:20 Go ahead and pause this video while you download any files you need and 0:22 set up your programming environment. 0:26 Ready? 0:28 Great, let's look at our missing data. 0:29 Some data may already be missing while others maybe be removed later due 0:31 issues you uncover during the data cleaning process. 0:36 In this lesson, we use both the pandas and numpy libraries. 0:39 And now we load in our data. 0:48 In this lesson, we use both the Demographics, and BodyMeasures data. 0:50 Now, let's take a look at our missing data. 1:05 Some of the first questions you might ask, 1:08 are how many elements are missing from each column? 1:10 And what percentage of the data is missing from each column? 1:13 We can get the number of good values in each column 1:16 using the count method on our data frame. 1:18 We can get the total number of rows by finding the length of the index. 1:22 Subtracting those two numbers will get us the total number of missing elements in 1:30 each row. 1:34 There are a lot of columns, but 1:37 we can use the head method to show only the first few. 1:39 We can display the amount of missing data as a percentage by dividing by the total 1:42 number of rows, then multiplying by 100. 1:47 We can also look at how much information is missing from each row. 1:58 Using the DataFrames' isnull method, 2:02 we can add up the number of missing entries in each row. 2:04 Now we'll use the len function to get the number of columns. 2:15 Dividing the number of missing elements in each row by the number of columns then 2:25 multiplying by 100 gives us the percentage of data missing from each of the rows. 2:29 As we mentioned in the previous lesson, any rows or columns with 2:43 a large number of missing information will probably need to be fixed or removed. 2:46 But remember, these are only the entrance mark as not a number. 2:51 There may be other entries we also want to remove. 2:54 Let's take a closer look at the column for students datas. 2:58 According to the code book valid answers are, in school. 3:02 On vacation from school, between grades, neither in school. 3:06 Or on vacation from school, between grades. 3:10 Refused, and don't know. 3:13 The last two responses, refused and don't Know, are not missing entries. 3:16 But in most cases, 3:21 they aren't meaningfully different from truly missing data. 3:22 We can remove those values from our dataset 3:25 if we want all of our missing data represented by not a number. 3:27 Going back to our code, let's use the same method we used before to get 3:31 the percentage of missing data in the student status column. 3:35 About 67% of the data is missing. 4:05 We can get the number of refused answers by taking a sum of the number of entries 4:08 equal to 7. 4:12 And for don't know responses, we add up the number of entries equal to 9 4:19 Now let's print out our results. 4:30 There are no refused answers, but 2 who responded that they don't know. 4:54 Let's remove these entries so 4:58 that all missing information in this column is marked as not a number. 5:00 First, find all entries in that column greater than 3 since 3 is the largest code 5:04 that contains meaningful information. 5:08 And we set all values in that column at those indices to be NumPy dot not 5:16 a number. 5:21 And we look at all the unique values in that column, 5:33 we should see not a number 1, 2, or 3. 5:36 And there we have it. 5:45 No entries for 7 or 9. 5:46 Let's re-run our code that shows how much information is missing. 5:49 We don't have to change anything so we can just copy and paste it from before. 5:52 Those two entries didn't increase percentage of missing data by much. 6:20 We can verify again that there are no more entries of 7 or 9 in that column. 6:24 We also need to remove incorrect entries where the wrong type of data has 6:29 been entered. 6:33 Let's look at the data type for the first few demographics called demo, and 6:34 body measures called BMX. 6:39 Before printing out the first few columns, 6:41 let's have Python print the name of the data file so our output is easier to read. 6:43 Then we can look at the d-types properly to see the type of each column. 6:47 We're only interested in the first few columns so 6:51 we use the head method to limit the display. 6:54 And we do the same for our body measures data. 6:58 The sequence column should exist in all files using the same number for 7:10 each participant across the files. 7:14 But in demographics it's stored as integer data. 7:16 While in body measures it's stored as an object or string type. 7:19 Pandas has a to numeric function that will convert a data frame to numeric values. 7:22 Let's use it to convert the sequence column in body measures to integer data. 7:27 We use the downcast argument to specify, 7:37 that we want our data converted to integers. 7:39 But when we run it, we get an error. 7:45 It looks like someone entered the person's name, 7:50 Jane Doe, instead of their ID number. 7:52 We had an error because Pandas doesn't know how to convert a name into a number. 7:55 The two numeric function has another argument called errors 8:00 that tells it how to handle any errors that runs into while converting the data. 8:03 If we tell it to coarse the error, 8:07 it will insert not a number anywhere that the conversion fails. 8:09 No errors this time. 8:18 Let's see what the type is now. 8:20 Here's an example of why it's always good to double check that your code is doing 8:27 what you expect. 8:30 The type still isn't correct. 8:32 It's not longer an object type, but 8:34 now it's floating point instead of the integer values we wanted. 8:36 This is because the not a number entries are considered floats and not integers. 8:39 If this were only data file, we could probably generate a new ID number for 8:44 each of the participants. 8:49 As long as the other data in those rows were still okay. 8:50 But here, 8:53 the values in the sequence column need to match across different files. 8:54 And we have no way of knowing which ID number they should have had. 8:58 We can remove all rows with invalid IDs and 9:01 then we'll be able to convert the column to integer values. 9:04 First, we use NumPy's as NaN function to get a Boolean series telling us whether 9:07 each element in the column is not a number. 9:11 We can drop those rows by selecting the inverse of the Boolean values. 9:18 And now, let's try converting the column to integers again. 9:29 There we have it, the column is integer data now. 10:00 We could also remove any data containing sensitive information, like full names, 10:03 social security numbers, or credit information. 10:08 In this server, we have marital status for everyone aged 14 and above. 10:11 But we may want to remove that information for everyone considered a minor, which for 10:15 our purposes means anyone under 18. 10:19 So first, lets locate all rows with the age under 18. 10:22 Using the count method, we can see how many entries we have for minors. 10:35 It looks like there are 1,244 entries for marital status for minors. 10:52 To get rid of these data, replace all these entries with 10:57 not a number then print out the new count for those rows. 11:00 Now there are no entries in the marital status column for anyone under age 18. 11:25 Finally, remember to save your modified data set using the to_csv method, so 11:30 you can use it in future lessons. 11:35 Great work, we've just learned a lot about missing data in our dataset. 11:49 Before moving on, I have more practice exercises for you. 11:53 The teacher's notes have a link to the practice notebook for this lesson. 11:56 This notebook has instructions for two practice exercises. 12:00 First, identify the column in demographics 12:03 with the highest percentage of missing information. 12:06 Second, remove all the pregnancy status information for minors like we did for 12:09 their marital status. 12:13 Don't forget to save your data again when you're finished. 12:14 When you're finished take a moment to relax. 12:17 We have just covered a lot of information. 12:19 There's a lot that goes into dealing with missing data. 12:22 In the next video, we'll be looking at what to do when we have the same data 12:24 recorded in our data set multiple times. 12:28
You need to sign up for Treehouse in order to download course files.Sign up