Exploring Your Dataset7:15 with Alyssa Batula
In this video, we will take our first look at the dataset we will be using as an example throughout this course.
- Jupyter Notebook -- Web application for creating documents containing live code and explanatory text.
Data Source Information:
National Health and Nutrition Examination Survey (NHANES) main site
NHANES 1999-2000 information
Body Measurements Documentation/Codebook
Python and Pandas Resources:
Let's take our first look at the dataset we'll be using as an example in this 0:00 course. 0:04 Our dataset is a modified version of the 99 to 2000 National Health and 0:05 Nutrition Examination Survey. 0:10 The original data is from a series of studies used to determine the health and 0:12 nutrition of both adults and children in the United States. 0:17 The data is split into multiple files, and contains information from both 0:20 interviews and physical examinations from a sample of the population. 0:25 The collected data includes demographic, socio-economic, dietary, 0:29 medical, and physiological information, from over 5,000 people. 0:34 This should give us plenty of data to explore. 0:38 The researchers who conducted the study have already done some cleaning to 0:41 the dataset. 0:45 For this course, I've modified the data so that it contains examples of the different 0:46 types of problematic data we discussed previously. 0:51 Now let's take our first look at the data. 0:54 You can download the data files from the links provided in the teacher's notes. 0:56 I'll be using Jupyter Notebooks for this course. 1:01 But feel free to use whichever environment is most comfortable for 1:03 you as you follow along. 1:07 You can also download the Jupyter Notebook containing all the code 1:08 that I'll be using from the teacher's notes. 1:12 Take a moment to pause this video and download those files. 1:14 Remember, you can slow down or pause the video whenever you would like more time to 1:18 work with the code or data files. 1:23 First, we need to load our data and extra libraries into Python. 1:25 We use the pandas data analysis library for Python, 1:28 that provide some convenient tools and data structures. 1:31 If you'd like to learn about pandas, 1:34 there are learning resources available in the teachers notes for this video. 1:36 Start up your Python environment in the same folder as the data files. 1:40 Once our Python environment is set up, we can import the pandas library and 1:44 load in the data using the pandas read CSV function. 1:48 This loads the data into a table like dataframe object. 1:51 So first we'll import the pandas library. 1:55 And then we load in our three data files. 1:59 First, Demographics, Then, 2:02 BodyMeasures, And then Occupation. 2:10 You can use the describe method, 2:23 to get a summary of the data in each numeric column. 2:25 This summary will tell us how many values are present, the mean and 2:29 standard deviation, the maximum and minimum values, and the 25th, 50th, and 2:33 75th percentiles for each column. 2:38 Keep in mind that it will skip all columns that aren't numeric, such as text entries. 2:40 The head function will show you the first few rows of all columns in the data frame. 2:45 These are large tables, and 2:52 you may not be able to see all of the columns when displaying the table. 2:53 You can look at a smaller number of rows and 2:57 columns by specifying the column names using the LOC or LOC indexer. 3:00 Rows are specified first, with a single colon, meaning select all of the rows. 3:04 Colums are specified after a comma with either a colon, indicating all columns, or 3:09 a list of the column names that we want to select. 3:14 We're going to pick five columns and all of them have some weird names. 3:17 The code book for the demographics dataset will tell you what each of them means. 3:21 For example, we can see that SEQN is the column for the respondent sequence number. 3:25 This is a fancy way of saying that each person they interviewed was given 3:31 a unique ID number. 3:35 The code book also has other useful information about the data like who it was 3:36 collected for and what values are possible. 3:40 There's a link to each of the code books in the teacher's notes. 3:43 So we use the LOC method and use the colon to tell it to select all the rows. 3:47 Then we give it a list of our five columns. 3:53 SEQN for sequence number. 3:55 SDDSRVYR the data release number. 3:58 RIDSTATR interview an exam status. 4:04 RIDEXMON the six month time period in which the data was collected. 4:09 And RIAGENDR, the person's gender. 4:17 And then we use the head method so that only the first few rows are shown. 4:22 You can also specify a section of the table by row and column number using ILOC. 4:30 Remember that indexing starts at 0. 4:36 So the first column is number 0. 4:38 As with LOC indexing, rows are specified first and 4:40 columns are specified after a comma. 4:44 So let's use iloc to select the first four rows and the first five columns. 4:47 According to the code book, the sequence column uses a distinct number for 4:57 each participant. 5:01 Which is the same across all the data files. 5:02 We can combine multiple data frames using the merge function. 5:04 The first two arguments are the two data frames to be merged. 5:08 The on keyword tells the function which column values 5:11 should be matched between the two data frames. 5:15 The how argument specifies an inner, outer, left, or right join, 5:17 similar to how databases are joined. 5:22 Using inner we'll only keep entries for people who have data in both tables. 5:25 While an outer join we'd keep all rows, 5:29 even if a person is present in one dataframe and not the other. 5:31 Let's merge the demographics and body measures dataframes. 5:35 We'll join them on the sequence column and 5:38 use an inner join to keep only those examples present in both dataframes. 5:40 If you want to save your merged data frame to a CSV file, 5:53 you can use the to_csv method. 5:57 Using index=false tells pandas not to add an additional column 6:04 containing an index for each row. 6:09 Take some time to continue exploring the dataset. 6:11 Make sure you also take a look at the code books for the files we're using. 6:14 The code books will tell you what the column names mean, 6:18 what information they contain, and what values we should expect. 6:21 Keep in mind that we aren't using the original data files. 6:25 In particular, I changed many columns in the demographics file from numeric 6:28 to text entries to give you practice working with text data. 6:32 So far, we've only glance at the demographics file. 6:36 Try doing the same things with the body measures and occupation files. 6:39 Practice selecting different rows and columns from the dataset. 6:43 Try merging the three files into a single dataset. 6:46 Compare the size of the dataset when using inner and 6:50 outer joints to merge the data frames. 6:53 If you want to be able to easily rerun your code at a later date, 6:55 you can put it in a script file or Jupyter Notebook. 6:59 There is one more quiz after this video. 7:02 And then you'll have completed the first stage of this course. 7:05 Congratulations, in the next stage, we'll continue working with this dataset to 7:07 practice finding and collecting dirty data. 7:12
You need to sign up for Treehouse in order to download course files.Sign up