Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Understanding the information in your dataset and how it was collected is essential for determining how to handle any dirty data.
Data Sources:
-
data.world
-
Kaggle competition datasets
-
National Health and Nutrition Examination Survey
-
The Star Wars API (SWAPI)
Further Reading:
Why the ‘Boring’ Part of Data Science is Actually the Most Interesting
Data Science: A Kaggle Walkthrough Pt 1: Introduction
Data Science: A Kaggle Walkthrough Pt 2: Understanding the Data
Data Science: A Kaggle Walkthrough Pt 3: Cleaning Data
Tidy Data
Now let's discuss the importance
of understanding a data set
0:00
before beginning to clean it.
0:03
Understanding the information in your
data set and how it was collected
0:05
is essential for deciding how to find and
handle any dirty data.
0:09
First, consider the questions you
want to answer with your data.
0:14
Knowing these questions will help you
decide which information is most relevant.
0:17
For example, information on pet
ownership would be highly relevant
0:22
if you want to learn about shopping
habits, but less relevant if you're
0:26
interested in the factors that cause
a person to grow to a certain height.
0:30
This may be a simplistic example,
but being able to identify important
0:34
information will help guide your approach
to fixing issues in your data set.
0:38
You also need to know and
understand the data in your data set.
0:43
If you don't know what
your data represents or
0:47
what kind of entries to expect, you may
not find some of the less obvious issues.
0:49
An excellent example of this
is finding nonensical data.
0:54
If you didn't know that
a person's age starts at zero,
0:58
you wouldn't know that ages less
than zero don't make any sense.
1:01
As a more subtle example, imagine a data
set created by a weather station that
1:05
records the air temperature and pressure
every day at 7 AM, 12 PM, AND 5 PM.
1:11
An entry at 3 PM would indicate the system
was not working as intended and
1:16
would be considered nonsensical,
1:20
even though there's nothing wrong
with 3 PM as a time entry in general.
1:22
This is also true of saturated data.
1:27
If you don't know the limits
of measurement devices,
1:30
you won't be able to recognize
when an entry is saturated.
1:32
Find out as much as you can
about your data before and
1:36
during the data cleaning process.
1:39
Many data sets have documentation
on how the data was collected,
1:41
what it represents, and any processing
that has all ready been done.
1:45
This is often called a code book.
1:49
You can also ask questions of anyone
with relevant experience or expertise.
1:52
These could be people who were
involved in creating the data set or
1:57
people who work in a similar area and
use similar data.
2:00
The more you can find out about your data,
the better, faster,
2:04
and easier your data cleaning will go.
2:07
In the next video,
2:10
we'll take our first look at the data set
we'll be using throughout this course.
2:12
You need to sign up for Treehouse in order to download course files.
Sign up