Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
In this video, we will take our first look at the dataset we will be using as an example throughout this course.
New Terms:
- Jupyter Notebook -- Web application for creating documents containing live code and explanatory text.
Data Files:
Data Source Information:
-
National Health and Nutrition Examination Survey (NHANES) main site
-
NHANES 1999-2000 information
-
Demographics Documentation/Codebook
-
Body Measurements Documentation/Codebook
-
Occupation Documentation/Codebook
Python and Pandas Resources:
Let's take our first look at the dataset
we'll be using as an example in this
0:00
course.
0:04
Our dataset is a modified version of
the 99 to 2000 National Health and
0:05
Nutrition Examination Survey.
0:10
The original data is from a series of
studies used to determine the health and
0:12
nutrition of both adults and
children in the United States.
0:17
The data is split into multiple files,
and contains information from both
0:20
interviews and physical examinations
from a sample of the population.
0:25
The collected data includes demographic,
socio-economic, dietary,
0:29
medical, and physiological information,
from over 5,000 people.
0:34
This should give us plenty
of data to explore.
0:38
The researchers who conducted the study
have already done some cleaning to
0:41
the dataset.
0:45
For this course, I've modified the data so
that it contains examples of the different
0:46
types of problematic data
we discussed previously.
0:51
Now let's take our first look at the data.
0:54
You can download the data files from
the links provided in the teacher's notes.
0:56
I'll be using Jupyter Notebooks for
this course.
1:01
But feel free to use whichever
environment is most comfortable for
1:03
you as you follow along.
1:07
You can also download the Jupyter Notebook
containing all the code
1:08
that I'll be using from
the teacher's notes.
1:12
Take a moment to pause this video and
download those files.
1:14
Remember, you can slow down or pause the
video whenever you would like more time to
1:18
work with the code or data files.
1:23
First, we need to load our data and
extra libraries into Python.
1:25
We use the pandas data
analysis library for Python,
1:28
that provide some convenient tools and
data structures.
1:31
If you'd like to learn about pandas,
1:34
there are learning resources available
in the teachers notes for this video.
1:36
Start up your Python environment in
the same folder as the data files.
1:40
Once our Python environment is set up,
we can import the pandas library and
1:44
load in the data using
the pandas read CSV function.
1:48
This loads the data into
a table like dataframe object.
1:51
So first we'll import the pandas library.
1:55
And then we load in our three data files.
1:59
First, Demographics, Then,
2:02
BodyMeasures, And then Occupation.
2:10
You can use the describe method,
2:23
to get a summary of the data
in each numeric column.
2:25
This summary will tell us how many
values are present, the mean and
2:29
standard deviation, the maximum and
minimum values, and the 25th, 50th, and
2:33
75th percentiles for each column.
2:38
Keep in mind that it will skip all columns
that aren't numeric, such as text entries.
2:40
The head function will show you the first
few rows of all columns in the data frame.
2:45
These are large tables, and
2:52
you may not be able to see all of
the columns when displaying the table.
2:53
You can look at a smaller
number of rows and
2:57
columns by specifying the column
names using the LOC or LOC indexer.
3:00
Rows are specified first, with a single
colon, meaning select all of the rows.
3:04
Colums are specified after a comma with
either a colon, indicating all columns, or
3:09
a list of the column names
that we want to select.
3:14
We're going to pick five columns and
all of them have some weird names.
3:17
The code book for the demographics dataset
will tell you what each of them means.
3:21
For example, we can see that SEQN is the
column for the respondent sequence number.
3:25
This is a fancy way of saying that
each person they interviewed was given
3:31
a unique ID number.
3:35
The code book also has other useful
information about the data like who it was
3:36
collected for and
what values are possible.
3:40
There's a link to each of the code
books in the teacher's notes.
3:43
So we use the LOC method and use the colon
to tell it to select all the rows.
3:47
Then we give it a list
of our five columns.
3:53
SEQN for sequence number.
3:55
SDDSRVYR the data release number.
3:58
RIDSTATR interview an exam status.
4:04
RIDEXMON the six month time period
in which the data was collected.
4:09
And RIAGENDR, the person's gender.
4:17
And then we use the head method so
that only the first few rows are shown.
4:22
You can also specify a section of the
table by row and column number using ILOC.
4:30
Remember that indexing starts at 0.
4:36
So the first column is number 0.
4:38
As with LOC indexing,
rows are specified first and
4:40
columns are specified after a comma.
4:44
So let's use iloc to select the first
four rows and the first five columns.
4:47
According to the code book, the sequence
column uses a distinct number for
4:57
each participant.
5:01
Which is the same across
all the data files.
5:02
We can combine multiple data
frames using the merge function.
5:04
The first two arguments are the two
data frames to be merged.
5:08
The on keyword tells
the function which column values
5:11
should be matched between
the two data frames.
5:15
The how argument specifies an inner,
outer, left, or right join,
5:17
similar to how databases are joined.
5:22
Using inner we'll only keep entries for
people who have data in both tables.
5:25
While an outer join we'd keep all rows,
5:29
even if a person is present in
one dataframe and not the other.
5:31
Let's merge the demographics and
body measures dataframes.
5:35
We'll join them on the sequence column and
5:38
use an inner join to keep only those
examples present in both dataframes.
5:40
If you want to save your merged
data frame to a CSV file,
5:53
you can use the to_csv method.
5:57
Using index=false tells pandas
not to add an additional column
6:04
containing an index for each row.
6:09
Take some time to continue
exploring the dataset.
6:11
Make sure you also take a look at
the code books for the files we're using.
6:14
The code books will tell you
what the column names mean,
6:18
what information they contain,
and what values we should expect.
6:21
Keep in mind that we aren't
using the original data files.
6:25
In particular, I changed many columns
in the demographics file from numeric
6:28
to text entries to give you
practice working with text data.
6:32
So far, we've only glance
at the demographics file.
6:36
Try doing the same things with the body
measures and occupation files.
6:39
Practice selecting different rows and
columns from the dataset.
6:43
Try merging the three files
into a single dataset.
6:46
Compare the size of the dataset
when using inner and
6:50
outer joints to merge the data frames.
6:53
If you want to be able to easily
rerun your code at a later date,
6:55
you can put it in a script file or
Jupyter Notebook.
6:59
There is one more quiz after this video.
7:02
And then you'll have completed
the first stage of this course.
7:05
Congratulations, in the next stage, we'll
continue working with this dataset to
7:07
practice finding and
collecting dirty data.
7:12
You need to sign up for Treehouse in order to download course files.
Sign up