Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7day trial. Sign In Enroll
Start a free Courses trial
to watch this video
Sometimes data is not obviously wrong or missing, but contains a more subtle error that can be found with knowledge of our data and how it was collected.
Data Files:
Data Source Information:

National Health and Nutrition Examination Survey (NHANES) main site

NHANES 19992000 information

Demographics Documentation/Codebook

Body Measurements Documentation/Codebook

Occupation Documentation/Codebook
Python and Pandas Resources:
Sometimes data isn't obviously wrong or missing, it has a more subtle error. 0:00 Many of these errors can only be found by using our knowledge of the data and 0:05 how it was collected. 0:08 We'll be continuing to fix our example data set. 0:10 Hopefully you saved the changes you made in the previous lessons, but 0:13 if not, the updated versions are available in the teacher's notes. 0:16 There's also link to the code I'll be using, if you'd like to download that. 0:20 Go ahead and pause this video while you download any files you need and 0:24 setup your programming environment. 0:28 Ready, great. 0:30 We're going to start off by talking about nonsensical or impossible data. 0:32 These are data entries that are nonsensical or impossible, based on 0:36 the way that the data site was created or the nature of what was recorded. 0:40 First, we want to load in the libraries we'll need, numpy and pandas. 0:44 We'll also be working with all three files again, so I'll load those as well. 0:50 Now, let's find some impossible data. 1:07 Weight is a good example for impossible data. 1:10 We know a person can never have a negative weight. 1:12 So let's print a description of our weight column. 1:15 According to the summary of the weight column, 1:21 the lowest recorded weight is negative 149 kilograms. 1:23 We are not sure with the way we are suppose to have been, so 1:27 we can just remove all values less than zero. 1:30 First we find all rows with the way less than zero. 1:33 Then we replace all those values with not a number. 1:39 Now, we can view the description of our data again. 1:47 Now our lowest weight is just over three kilograms, a feasible value for a baby. 1:53 This is a good sanity check for weight, but it's very simplistic. 1:58 After all, we have all ages in our data set. 2:02 And a feasible weight range for 2:05 a two year old would be drastically different than one for a 50 year old. 2:06 If we wanted to be more careful, we could design more complicated checks and 2:10 take a participant's age into account. 2:14 More complicated checks are outside the scope of this course. 2:16 But something to keep in mind as you're cleaning your own data. 2:19 Categorical values can also have impossible values. 2:23 Let's take a look at the possible entries for weight comment. 2:26 The column for 2:29 weight comment as four possible values, coded as one through four. 2:30 Any other entry in that column is an error. 2:36 Let's take a look at the unique entries in the weight comment column. 2:39 It looks like the numbers 7 and 11 have also been entered. 2:47 Since they aren't valid entries, we need to remove them from our data set. 2:50 Again, we'll find all the invalid entries and replace them with not a number. 2:54 Let's see what our new data entries are now. 3:07 There we have it, only entries of one through four, or not a number. 3:16 Often data entries are problematic because they have extreme values. 3:21 According to Guinness World Records, 3:24 the heaviest person in the world weighed 635 kilograms at their heaviest. 3:26 Let's take another look at our weight column. 3:31 Our data set has a maximum weight of over 12,000 kilograms. 3:37 That's much higher than the world record. 3:41 Something must be wrong with those entries. 3:43 Since they don't make sense, let's remove all weights heavier than 635. 3:44 We can do the same thing as before. 3:50 Find all the rows with a weight above 635. 3:53 Then replace those rows with not a number. 3:58 And then finally we can look at our data again. 4:05 Now the maximum weight is a much more reasonable 193 kilograms. 4:10 Some analyses will also remove outliers, or values too far from the average. 4:15 Values too far from normal could bias the data set, or 4:20 there may be something wrong with this data to cause those values. 4:23 On the other hand, if we don't have a good reason to suspect a problem with them 4:27 these outliers may be valid, although unusual, data entries. 4:31 In that case, removing them eliminates important information from the data set. 4:35 Whether you remove outliers or not, will depend on your particular circumstances. 4:40 One way to find outliers is to determine how many standard deviations a value 4:44 is from the average for that group. 4:48 This is sometimes called the zscore. 4:50 To find the zscore of the maximum and minimum weight value, 4:52 we first calculate the mean and standard deviation in the entire column. 4:56 Note that we need to use nanmean and nanstd so that the functions can deal with 5:00 the present of not a number of values correctly. 5:05 Let's find the zscore for these maximum and minimum weights. 5:09 Like with the mean and standard deviation, we use nanmin and nanmax to find 5:13 the maximum and minimum values, while ignoring not a number entries. 5:17 To get the zscore for the lowest weight, 5:29 we subtract the mean from our minimum value. 5:31 Then we divide by the standard deviation. 5:37 We do the same thing to get the zscore for the maximum weight. 5:40 And then let's print out the zscores we just calculated. 5:48 And I misspelled high as height. 6:14 There we go. Our maximum weight is four standard 6:17 deviations from the mean. 6:20 And our minimum weight is one standard deviation below the mean. 6:21 Finally, let's look for saturated data. 6:25 According to the code book, all ages above 85 were recorded as 85. 6:28 This means that the age value saturates at 85, 6:33 anything above that value is a mistake. 6:36 Also, we don't know the true age of anyone recorded as 85, 6:38 only that they are at least 85. 6:42 If a person's exact age is important, 6:45 we may want to remove all saturated entries or run a separate analysis for 6:47 people 85 and older, that does not include age as a term. 6:52 In other cases, 6:56 grouping all those over 84 into a single age may not affect the analysis. 6:57 Let's see what our maximum age is. 7:02 We have at least one entry above the saturation limit, 7:11 with a maximum listed at 109. 7:14 If we feel confident that someone just forgot to list ages at a maximum of 85, 7:16 we can replace all higher values with 85. 7:21 If we're concerned that some other error caused the problem, 7:24 it may be better to remove the values entirely. 7:27 Let's replace all values above 85 with a value 85. 7:30 Remember to save all the changes we've made to the data files. 7:44 That's it for the video portion of this lesson. 8:11 Now it's time for you to try out what you've learned on your own. 8:13 The teacher's notes have a link to the practice notebook for this lesson. 8:16 There are two practice exercises for you this time. 8:19 You'll need to check some of the columns in the occupation file for 8:22 nonsensible values. 8:25 Then you'll find the zscore of the maximum and 8:26 minimum values of column OCQ180 in occupation. 8:29 Make sure you don't count values 7777, or 9999, since those 8:33 are extra code values and not actually the number of hours worked. 8:36 Remember to save your data again when you're finished. 8:42 Afterwards, it's time for another quiz. 8:44 When you've completed the quiz, 8:46 we'll talk about ways to help automate the data cleaning process. 8:48
You need to sign up for Treehouse in order to download course files.
Sign up