Automating Data Preparation8:08 with Alyssa Batula
Some aspects of data clean-up and data-checking can be automated to save us time and effort.
- Software Testing - Verifying that the operation of a portion of computer code is correct.
Data Source Information:
National Health and Nutrition Examination Survey (NHANES) main site
NHANES 1999-2000 information
Body Measurements Documentation/Codebook
Python and Pandas Resources:
Welcome to the last lesson in this stage of the course. 0:00 We're going to be talking about automating the data cleaning process. 0:03 Data cleaning involves a lot of human knowledge and decision making. 0:07 But we can save ourselves some time by automating whatever processes we can. 0:11 Many of these are types of sanity checks, making sure your data 0:16 is in the ranges you'd expect and that your code is doing what you want it to do. 0:19 In most cases, you'll have to use your knowledge of your data set 0:24 to decide what steps you can automate. 0:27 You'll need the data files you saved from the last lesson, or 0:30 you can download the exact files I'll be using from the teacher's notes. 0:33 The code I'll be using is also available. 0:37 Go ahead and pause this video while you download any files you need and 0:40 setup your programming environment. 0:43 Everything ready? 0:45 Here we go. 0:46 Let's load in our libraries. 0:48 We want both pandas and numpy for this lesson. 0:50 And then we'll load all three of our data files. 0:53 Numpy supports testing, 1:16 with a set of methods that assert that something is true. 1:17 If the assertion is true, the code will continue as normal. 1:20 But if it isn't, it raises an assertion error to alert you to the problem. 1:24 We can use them to make sure that our data frames are the right size. 1:28 To test a data frame's size, use np.testing.assert_array_equal. 1:31 Pass in the shape as one argument and 1:37 a 2-pole of the expected shape as the other argument. 1:39 The function will check to make sure the two arguments passed are identical. 1:42 These numbers are correct, so when it runs nothing happens. 2:08 If I change one of the expected values, we get an assertion error letting us know 2:12 that there's something wrong with our data. 2:16 Like this. 2:23 This type of testing is especially helpful when we join our changed data frames. 2:27 For example, when we join our three files 2:32 we want to check that the number of rows and columns is correct. 2:35 First, let's merge demo and 2:38 bmx into a single data frame using an inner join on the SEQN column. 2:40 Then we can join that dataset with the OCQ data frame, again, 2:48 with an inner join on SEQN. 2:52 The only duplicated column should be the SEQN column. 3:01 If we add the number of columns in our three datasets, then subtract two for 3:04 the repeated and joined SEQN column, 3:08 that should be the total number of columns in our new dataset. 3:11 Let's get our expected total number of columns. 3:14 We can check that this matches the actual number of columns 3:17 with the assert equal function. 3:20 We can also check that one value is less than another. 3:28 For example, we can check that the new dataset has fewer rows 3:32 than our original demographics dataset. 3:35 We know there should be fewer rows because using an inner join will drop 3:37 all participants who didn't also complete the physical exam or occupation survey. 3:41 To do this we use the assert_array_less function. 3:45 We pass two arguments, with the one we expect to be smaller first. 3:49 We can also remove all white space at once rather than specifying individual columns. 3:57 We can loop through all columns in our data frame and 4:02 call the strip method on all of them. 4:04 We wrap that code in a try accept block 4:06 because calling that method on a numeric column will raise an error. 4:09 First, loop through all the columns in demographics. 4:12 Then start our try clause. 4:17 Strip the white space from the current column. 4:22 And now we finish our try catch block by catching attribute errors. 4:32 If we encounter just pass and continue. 4:36 We can also speed up checks for valid data through automation. 4:41 We still need to choose the acceptable values ourselves but 4:44 then we can automate the checking and replacing. 4:47 There are many ways this can be set up. 4:50 But one option is to create a dictionary, with column names as keys, and 4:52 the acceptable entries as the value. 4:56 For categorical data, we can use the data frames is in method, 4:58 to find all values that are in the list, passed as an argument. 5:01 Let's start with discrete values. 5:05 We'll make a dictionary with the valid codes, for two columns. 5:07 DMDBORN has values 1, 2, 3, 7, and 9. 5:12 DMDCITZEN has values 1, 2, 7, and 9. 5:20 Next, we iterate over each key in our dictionary. 5:25 We find all rows where the value for that column is in our specified values. 5:29 Then we can set all rows that don't match to be not a number. 5:34 For continuous values, we can use a similar method, but 5:56 specify a valid range instead of individual values. 5:59 Let’s make a new dictionary specifying valid ranges for weight and height. 6:02 Weight can range from 0 to 635. 6:06 And height can range from 81.8 to 201.3. 6:12 Now we loop through the specified columns again. 6:20 We can check with the values in the column or within the specified range, 6:23 and set any that aren't to be not a number. 6:26 Finally we can automatically find any rows or columns with too much missing data. 6:57 Here we set our maximum percentage of missing data to be 30. 7:02 We can get a list with all columns with more than 30% of the data missing, 7:07 to help us when deciding which columns need to be removed or fixed. 7:12 First, get the number of valid entries for each column and the number of rows. 7:15 Subtract the number of valid entries from the total rows, 7:27 Divide by the total number of rows, and multiply by 100. 7:31 Now we can get the columns that have more than allowed of our maximum percentage 7:36 missing data. 7:40 And we can print them out to see which ones we found. 7:45 Automating the simple parts of data cleaning can really speed up the process, 7:50 help us make better decisions, and even catch some mistakes. 7:54 There's just one more quiz between you and the end of this stage, congratulations. 7:57 In the next stage, 8:02 we'll discuss ways to select only the most important data to include in our dataset. 8:03
You need to sign up for Treehouse in order to download course files.Sign up