Simple Data Issues8:37 with Alyssa Batula
In this video, we will use the provided dataset to find and fix some common problems in the dataset.
- Jupyter Notebook -- Web application for creating documents containing live code and explanatory text.
Scroll to the bottom of the Download File to access the "Individual Practice" and the three tasks associated.
Data Source Information:
National Health and Nutrition Examination Survey (NHANES) main site
NHANES 1999-2000 information
Body Measurements Documentation/Codebook
Python and Pandas Resources:
[MUSIC] 0:00 Now that we know a little bit about our dataset and the data cleaning process, 0:04 let's take a closer look at some common issues using our example dataset. 0:08 We'll be getting hands-on practice with our dataset. 0:13 Starting with some simple yet 0:16 important issues you may find during the data cleaning process. 0:17 We'll be using the same data sets we downloaded in the last stage. 0:21 You can also find the link to download them again in the teacher's notes. 0:25 There are also links to the code books for the original data sets 0:29 as well as a Jupiter notebook with the code I'll be using in this video. 0:32 Feel free to follow along using whatever programming environment you like best. 0:36 Take a moment to pause this video and make sure that you have the data files, 0:40 open the dataset codebooks, and set up your environment the way you want it. 0:45 Remember, you can slow down or pause the video at any point 0:49 if you need more time to go over some code or finish typing. 0:53 First, let's load in the pandas library, and call it pd for short. 0:57 Now, we can use pandas to load in our data files. 1:03 Make sure you are working in the same directory as your files. 1:06 Once we have our data, let's take a look at the type of data stored in each column. 1:11 There are a lot of columns in our demo data frame. 1:15 So we'll just look at a few of them for now, sequence, age, 1:18 gender, and military status. 1:25 We can find the data type of each row using panda's dtype property. 1:32 We can see that the sequence column is stored as an integer. 1:37 Age is stored as a float or decimal number. 1:40 The remaining columns are all objects, which is what pandas call strings. 1:43 We can used panda's unique method to look at all the unique entries in a column. 1:48 Let's print the length of the unique elements of the column for 1:53 military status coded as DMQMILIT. 1:57 From our output, we can see that there are 38 unique entries in this column. 2:05 That sounds like a lot of different options. 2:10 So let's see what they are. 2:12 This time, let's print all of the unique values. 2:13 We can see that many of the different entries should actually be the same value. 2:21 There are just extra spaces at the beginning or 2:25 end of the entry that make them appear different. 2:27 For example, here we have the word no with two spaces in front of it. 2:30 And here we have No with a single space in front of it, both as separate entries. 2:34 Both values mean the same thing, so we want to combine them into the same value. 2:39 We can use this using the data frame's strip method. 2:44 This will remove all that extra white space from the entire column. 2:47 Now that we've gotten rid of the white space, let's print out the number and 3:03 values of the unique entries again. 3:06 We've gotten rid of many of the extra entries, but it's still not quite right. 3:17 Sometimes yes and no are written out, 3:21 while other times only the first letter is used. 3:23 Let's replace all entries using a single letter to use the entire word. 3:28 Pandas also has a handy replace method that we can pass a set of nested 3:32 dictionaries, or a dictionary within a dictionary. 3:36 Telling it what we want to replace. 3:39 The outer dictionary's keys tells pandas which column to look at. 3:41 Here we pasted the code for the military status column. 3:45 The value for that key is another dictionary. 3:53 This interdictionary has the data entry values to be replaced as keys. 3:56 The value for each key is what we want to replace the key with. 4:01 So we use Y for the key and yes for 4:05 the value in order to replace all of the entries of Y with yes. 4:07 Note that we only specify values that we want to change. 4:12 Next, we pass this dictionary into the replace method. 4:16 We also set the inplace argument to true so 4:21 that it will modify our demo data frame. 4:23 If we didn't specify that, 4:26 it would return a new data frame instead of changing the one we already have. 4:28 Now let's print the number and contents of all unique entries for that column again. 4:32 After running this code, we can see that we are down to the five expected values. 4:43 Not a number, for missing data, yes, no, don't know, and refused. 4:48 Let's do this again for the citizenship status column. 4:53 According to the code book, there should be five unique values, 5:12 citizen by birth or naturalization, Not a citizen of the US. 5:16 Refused, don't no, or missing, which is listed as not a number. 5:19 This one has 31 different entries. 5:24 Unlike the last column, this one has a lot of white space issues too. 5:27 We can use the same method to remove the extra white space for this column. 5:31 And then see how many unique values are left. 5:47 We still have eight values left, and 5:56 it looks like there are some typos with the Don't know entry. 5:58 One or more are missing the apostrophe, 6:01 and there's also at least one that capitalize the K in Know. 6:03 Someone also used the word unknown instead of don't know when entering the data. 6:08 Let's fix these typos with a replace method. 6:13 First, we need to create a new dictionary telling pandas to replace all those typos 6:15 with the words don't know in the citizenship column. 6:20 Then we pass this new dictionary into the replace method. 6:43 And look at our unique entries for that column. 6:54 Success. We now only have the five entries we're 7:04 looking for. 7:06 We could also have fixed both columns at once, by passing the replace method 7:07 a larger dictionary, with information for both columns. 7:10 Finally, remember to save your modified data set using the to csv method, 7:34 so you can use it in future lessons. 7:39 Now it's your turn. 7:46 We've fixed white space issues and typos in two of our columns. 7:48 But those aren't the only ones in our data set. 7:52 There's a link to a practice notebook in the teacher's notes. 7:54 This notebook has instructions for three practice exercises. 7:57 First, find all of the columns in the demographics file 8:01 that contain string data. 8:04 Second, get rid of all the excess white space in these columns, and 8:06 finally, get rid of any typos and inconsistencies, so 8:09 that all the string entries match the ones listed in the code book. 8:13 Once you finish, remember to save your data again. 8:16 You can see what the text entry should be for each column by looking for 8:19 the column key, then checking the entry keys under value description. 8:23 That is it for this video. 8:27 Once you finish these practice exercises, you will be ready for the next lesson. 8:28 We will look at more ways to use the replacement method 8:32 to format our data the way we want it. 8:35
You need to sign up for Treehouse in order to download course files.Sign up