Bummer! This is just a preview. You need to be signed in with a Basic account to view the entire video.
Sensible Column Names and Values6:12 with Alyssa Batula
In this video, we will explore more simple data fixes, including using sensible column names and values. We will stress the importance of using meaningful names for column headers and make string entries numeric codes.
Data Source Information:
National Health and Nutrition Examination Survey (NHANES) main site
NHANES 1999-2000 information
Body Measurements Documentation/Codebook
Python and Pandas Resources:
In the last lesson, we saw how much trouble white space and 0:00 typos can cause in a data set. 0:03 A simple typo or extra space in a entry can cause our computers to think 0:05 two entries are different when they should really be the same. 0:09 One way of getting around this is to use code values for 0:12 entries instead of text strings. 0:16 Using a numerical code can eliminate most spelling and white space issues. 0:18 So let's take a look at our data set and 0:23 replace all of our string entries with numeric codes. 0:24 If you don't have your data set from the previous lessons, 0:28 you can download the one I'll be using through the link in the teacher's notes. 0:31 The links to the codebooks for 0:35 the original data sets are also available in the teacher's notes, 0:36 as well as a Jupiter notebook with all of the code I'll be using during this video. 0:40 Take a moment to pause this video and make sure that you have the data files. 0:44 Open the data set codebooks and set up your environment the way you want it. 0:48 Remember, you can slow down or pause the video at any point if you need more time 0:53 to go over some code or finish typing. 0:58 First, let's load in the pandas library again, and 1:00 then we can use it to load in our data file. 1:03 Now, let's replace the text entries with numeric codes for 1:12 the military status and citizenship status columns. 1:15 These are the same two columns we worked with in the last video 1:19 to fix the white space issues. 1:22 To find the codes to use, go to the Demographics codebook. 1:24 If you don't already have it open, there's a link to it in the teacher's notes. 1:27 To the right of the text entries, we can see a column called Code or Value. 1:31 We're going to be replacing the strings with these values. 1:35 First, we'll create a set of nested dictionaries like last time. 1:39 The first key is the code for the military status column. 1:42 Its value is a second dictionary where each key is the text string and 1:49 the value is the numeric code. 1:53 So we start with a key Yes and the value 1. 1:55 Then, key No and value 2. 1:59 Refused has code 7 and Don't know has value 9. 2:02 Now we can close the dictionary for military status and do the same thing for 2:10 citizenship status. 2:14 Citizen by birth or naturalization has a code of 1. 2:19 Not a citizen of the US has a code 2. 2:24 Refused is 7 and Don't know is 9. 2:29 And it's probably best if we remember all of our commas. 2:39 There we go. 2:51 You pass this dictionary into the replace method like we did in the last video. 2:53 Also, like the last video, we set the inplace argument to be True. 2:59 And now let's print out the unique entries for 3:05 these two columns using the unique method. 3:07 As we hoped, all of the string entries have been replaced with numbers. 3:22 Now we won't have to worry about typos or white space issues in these columns. 3:26 We've done quite a bit of replacing data entries with the replace method. 3:30 We can also change the names of columns with a similar rename method. 3:34 While coded entries are great for computers, they're harder to read and 3:38 understand. 3:41 What if we wanted to rename these two columns to a human readable string? 3:42 For this method, we use a single level dictionary. 3:46 Each key is the current column name and 3:48 the value is the text you want to replace it with. 3:50 Let's replace DMQLILIT with Veteran/Military Status, 3:53 And DMDCITZN with Citizenship Status. 4:04 To make the replacement, we pass this dictionary into the rename method. 4:11 And don't forget to set inplace to True again. 4:20 And we can view the columns to see that our changes were made. 4:24 We don't need to look at all the columns, but 4:27 we can see that ones we changed if we look at columns 10 through 15. 4:29 And there we are. 4:34 Nice, human readable columns. 4:35 We can use the replace and rename methods to go back and 4:39 forth between computer friendly coded values and 4:42 human readable string entries depending on which version we need for an analysis. 4:45 Now that we've changed some columns to coded values, 4:50 let's take another look at the column types. 4:52 Remember, we can use the dtypes object of our data frame to see what type each 4:55 row contains. 4:59 We can see that the columns for Veteran/Military Status and 5:07 Citizenship Status are now floats instead of objects as before. 5:10 This means we can now perform numeric operations on this data, 5:15 such as adding and subtracting values. 5:18 Whether or not mathematical operations are meaningful for 5:21 categorical data like this is a topic for another course. 5:23 Finally, remember to safe your modified data set, using the to_csv theme method so 5:27 you can use it in future lessons. 5:32 And now it's time for some practice on your own. 5:39 There's a link to a practice notebook in the teacher's notes, 5:41 which has some exercises for you to complete. 5:44 This notebook has instructions for two practice exercises. 5:46 First, convert all columns with string data 5:49 to use the coded values listed in the demographics codebook. 5:52 Then, change the two column names we modified 5:56 back to the coded values listed in the notebook. 5:58 Don't forget to save your data again when you're done. 6:01 Once you've completed these exercises, you'll be ready for the next lesson. 6:03 We'll be discussing when you can try to fix your data and 6:07 when it's best to take it out of the data set entirely. 6:09
You need to sign up for Treehouse in order to download course files.Sign up