Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Sometimes a bad data entry can be fixed while other times it is best to remove the data rather than include wrong information.
Data Files:
Data Source Information:
-
National Health and Nutrition Examination Survey (NHANES) main site
-
NHANES 1999-2000 information
-
Demographics Documentation/Codebook
-
Body Measurements Documentation/Codebook
-
Occupation Documentation/Codebook
Python and Pandas Resources:
Now let's talk about fixing and
excluding data from our data set.
0:00
Sometimes a bad data entry can be fixed.
0:04
While other times it's
best to remove the data
0:07
rather than include wrong information.
0:09
Usually it's better to fix your data
because that lets you keep your data set
0:12
as large as possible.
0:16
In the previous video, we covered some
simple issues that can usually be fixed,
0:17
like removing extra weight space or
fixing the spellings.
0:21
Other times it's better to
remove the problem data.
0:25
This is usually because the data is
missing or known to be wrong, but
0:28
you have no way of fixing it
by finding the correct value.
0:32
In the first stage of this course,
we talked about the importance of
0:35
understanding the data
you're working with.
0:38
Knowing your data is essential when
deciding if a problem can be fixed, or
0:41
if it needs to be removed.
0:45
We'll discuss the specific cases of
removing data in upcoming videos.
0:47
But first let's use a simplified data set
0:51
to illustrate ways to
handle removing data.
0:54
Sometimes we can remove a single data
entry and still work with the data.
0:58
Here we can see that
the X-wing crew size and
1:02
the length of the sail
barge are both missing.
1:05
If we could find that data somewhere,
1:08
we could fix these entries by
restoring the real values.
1:10
But if we can't find that information,
1:13
we may be able to just
leave the data as missing.
1:16
Some analysis methods can handle
missing data entries like these,
1:19
while others can't.
1:22
Another option is to fill in any missing
values with the average value for
1:24
that column.
1:28
Or you could use a random sample from
a normal distribution with the mean and
1:29
standard deviation that matches
the data in that column.
1:33
The best option will depend on
the dataset in question, and
1:37
the analysis you plan to run.
1:40
Sometimes, we can't leave any
missing values in our data set, or
1:42
we can only have a small
percentage of missing entries.
1:46
In these cases,
we may need to remove an entire row.
1:50
This usually happens when there's
a problem with many of the features for
1:53
a particular entry.
1:56
Here we can see that the TIE Fighter has
no values listed in any of the columns.
1:58
And this doesn't add any useful
information to our dataset.
2:02
So if we can;t replace those entries, we
would probably remove the entire entry for
2:05
the TIE Fighter.
2:09
The Death Star is also missing
two-thirds of its entries.
2:10
Although it does still have some
information, that's a lot of missing data,
2:14
and we may want to
remove that row as well.
2:17
Other times it makes more sense
to remove an entire column or
2:20
feature from the dataset.
2:23
Here we only have two entries for
the crew size of our ships.
2:25
If we need to remove all missing
entries from our data set,
2:29
removing all those rows would
eliminate most of our data.
2:32
Instead, we could get rid of the Crew
column, and keep all of our rows.
2:35
Whether it makes sense to remove rows or
2:39
columns in situations like this
depends on your specific use case.
2:41
You'll want to compare how important
a feature is to your analysis
2:45
with how many data entries you will
lose in order to keep that feature.
2:49
Sometimes we can merge columns
into a single feature,
2:53
instead of removing
the problematic columns.
2:56
Here we have some passenger information,
but a lot of it is missing.
2:59
If we want to keep the information we
do have, we could replace the passenger
3:03
feature with a capacity feature
that contains a sum of the crew and
3:06
passenger columns,
treating all missing information as 0.
3:10
Another option could be to run one
analysis on ships with passenger
3:14
information, and a separate analysis
on ships without that information.
3:18
Again, the best choice will be different
for each dataset and analysis.
3:22
So far in this stage,
we've covered simple data fixes and
3:27
talked a little bit about excluding data.
3:31
After this next quiz,
3:33
we'll continue by covering some more
complicated data issues, good luck.
3:35
You need to sign up for Treehouse in order to download course files.
Sign up