Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
In this video, we will outline many of the potential problems a dataset can have.
Key Terms:
- Saturation -- When a measurement reaches or exceeds the maximum or minimum value that can be recorded.
- NaN -- Not a number, indicates an invalid number entry.
Types of bad data:
- Formatting errors (e.g. extra whitespace, misspellings)
- Incorrect data type (e.g. numerical or string entries)
- Nonsensical data entries (e.g. age < 0)
- Duplicate entries (duplicate rows or columns)
- Missing data (e.g. NaN)
- Saturated data (e.g. value beyond a measurement limit)
- Systematic and individual errors (error affects many entries or only one)
- Confidential information (e.g. personally identifying or private information)
Data Sources:
-
data.world
-
Kaggle competition datasets
-
National Health and Nutrition Examination Survey
-
The Star Wars API (SWAPI)
Further Reading:
Why the ‘Boring’ Part of Data Science is Actually the Most Interesting
Data Science: A Kaggle Walkthrough Pt 1: Introduction
Data Science: A Kaggle Walkthrough Pt 2: Understanding the Data
Data Science: A Kaggle Walkthrough Pt 3: Cleaning Data
Tidy Data
Let's outline the main types of issues
you may run into when cleaning a dataset.
0:00
Later, we'll look at each of these in more
detail, and discuss ways to manage them.
0:05
The first type of data error is formatting
errors, such as misspellings, or
0:10
extra white space around text entries.
0:14
They make data appear
different to a computer,
0:16
when in reality they should be identical.
0:18
Formatting errors can be caused
from a typo during data entry or
0:20
from combining data sources
that use different formatting.
0:24
For example,
one data set may spell out yes and
0:26
no while another may use
only the letters Y and N.
0:29
In this table we have formatting errors
in the Lightspeed Capable column.
0:33
Some entries spell out Yes and No while
others use only the first letter, Y and N.
0:36
Even though Y and Yes mean the same thing,
they appear different to the computer.
0:42
Also we have a misspelled entry
where yess has an extra s.
0:47
Data entries can also have the wrong type.
0:51
For example, text or string entries in
a column that should be all numbers.
0:54
These can be caused by a data that's
stored or loaded incorrectly,
0:58
such as storing a row of numbers as
text instead of numerical values.
1:02
It can also be caused
by a data entry error,
1:06
such as typing your name in
a field meant to hold your age.
1:08
This table has two problematic entries,
the crew size for
1:11
the X-wing is entered as the word,
one, instead of a number.
1:15
The Death Star also has an entry of, 1,
for light speed capable, while 1 and
1:18
0 are sometimes used in place of Yes and
No.
1:23
The rest of the column uses text entries.
1:26
Some data entries may follow proper
formatting, but be impossible or
1:29
incredibly unlikely.
1:33
For example physical values such as age
weight and length could never be negative.
1:34
So any negative values for
those features are errors.
1:39
Here our X-wing has a negative length,
which is physically impossible.
1:42
It also has an entry of 11 for
1:47
Lightspeed Capable, which is
a meaningless value in this context.
1:48
Datasets can also have duplicated rows or
1:52
columns, where an example or
feature has been added more than once.
1:55
Duplications are redundant,
1:59
causing your dataset to take up more
space on your hard drive than it needs.
2:01
They can also bias your results
by putting too much emphasis
2:05
on their repeated entries.
2:08
Duplications can be caused
by a data entry error or
2:09
by incorrectly combining data from
multiple sources that have some overlap.
2:12
In our example,
we have two rows for the X-wing.
2:17
We also have a duplicated column.
2:20
Crew and Crew Capacity both
contain the same information.
2:22
How many crew members
can fit aboard the ship?
2:26
Sometimes there are values that
are missing from the dataset
2:29
because the information is not available.
2:31
These are often represented by NaN or
not a number value.
2:34
But can also be represented
by another text or
2:39
numeric entry indicating that
the information is unknown.
2:41
The table lists the light speed
capability of the Death Star as unknown.
2:45
The X-wing also has not a number
in place of the crew entry.
2:50
Both of these entries are missing
from our example dataset.
2:53
Data can also have saturation values or
2:58
values at the extreme
limits of measurement.
3:00
For example, a thermometer may be able to
read temperatures from negative 20 degree
3:02
Fahrenheit to 120 degrees Fahrenheit.
3:07
Or negative 29 degrees Celsius
to 49 degrees Celsius even if
3:09
the actual temperature were colder
than that range, for example,
3:13
negative 30 degrees Fahrenheit.
3:18
The thermometer would still only show
a temperature of negative 20 degrees.
3:20
We don't know the true value of any
data entries at the saturation limits.
3:24
Here, we have temperature measurements
taken with this thermometer
3:29
that can only read up to
120 degrees Fahrenheit.
3:32
The morning and
evening temperatures are fine.
3:35
But the measurements from
12 to 3 are saturated.
3:37
Since the thermometer can't go
any higher than 120 degrees,
3:41
we don't know what the actual
temperature was at those times.
3:44
As we discover problems in our data set,
3:47
it's important to consider whether
they are individual or systematic.
3:49
An individual error only affects a single
value like a typo during data entry.
3:54
A systematic error is
one that affects all, or
3:58
large portion,
of the data in a similar fashion.
4:01
An example of systematic error
would be a one foot ruler
4:05
that's actually only 11 inches long.
4:08
Every entry measured with that ruler would
be recorded as longer than it actual is,
4:11
causing a systematic error.
4:16
Finally, while it's not an error,
4:18
it's important to consider
confidential information in datasets.
4:20
You may want to remove personal
information from a data set to
4:24
prevent individual people
from being identified.
4:27
In our example, we have credit card
numbers which are highly confidential and
4:30
should never be included in any
kind of data table like this.
4:33
We also have names and
addresses which we may want to remove for
4:37
anonymity reasons depending on
who have access to the dataset.
4:41
To review, we've discussed eight
types of data errors or problems.
4:45
Formatting errors, incorrect data type,
nonsensical data entries,
4:49
duplicate entries,
missing data, saturated data,
4:54
systematic and individual errors,
and confidential information.
4:58
We'll discuss each of these problems
in more detail throughout this course.
5:04
After a short quiz, we'll continue
by talking about datasets and
5:08
understanding our data.
5:11
This will be important in later videos
when we decide how to deal with this
5:13
problematic data.
5:17
You need to sign up for Treehouse in order to download course files.
Sign up