Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Sometimes it seems that “the more, the merrier” applies to the amount of data you have, but including irrelevant data can waste computing resources or even mislead your analysis.
New Terms:
- Target -- The feature we want to predict or learn more about. Also called the dependent variable.
- Dependent Variable -- Variable or feature whose value depends on one or more other features
- Independent Variable -- Variable or feature whose value does not depend on other features.
[MUSIC]
0:00
It may seem like more data is always
better, but we want to make sure
0:04
we're only looking at information
that's relevant to our problem.
0:08
Let's talk about some reasons you may
actually want to make your data set
0:12
smaller by removing some of the features.
0:15
When looking at data,
0:19
we're usually looking to find
relationships between our features.
0:20
Often, there's one feature we're
particularly interested in,
0:24
sometimes called the target,
or dependant variable.
0:27
This is the information we want
to predict, or learn more about.
0:30
The other features are treated
as independent variables.
0:34
We want to know what influence these
features have on our target variable.
0:37
Including irrelevant data can
waste computing resources.
0:42
Storing data takes up space
on a computer hard drive.
0:45
So keeping data we don't
need wastes storage space.
0:49
It also takes longer to run
an analysis on a larger data set.
0:52
If some of that data isn't needed,
it makes your computer take extra time
0:56
to process data that isn't
helping your analysis.
1:00
It can also mislead an analysis,
1:04
causing it to try to find trends
in data that don't actually exist.
1:06
Large amounts of data are difficult
to visualize in a meaningful way.
1:10
The more features we have,
1:14
the more difficult it is to make
a clear visualization of that data.
1:15
With two features,
1:20
we can plot the relationship as
a simple scatter plot or line.
1:21
As the number of features increases,
1:25
it becomes more difficult
to visualize the data.
1:27
We start to need 3D graphs or
complex pairings of 2D graphs.
1:30
These make it more
difficult to look at and
1:34
understand the relationships
within our data set.
1:36
Getting rid of the unnecessary
features helps to focus on
1:39
only the important information.
1:42
And allows us to make a clearer graph.
1:44
To address this, we want to choose
the most relevant data to explore or
1:47
analyze further.
1:51
We don't want to get rid
of any important data, but
1:52
keeping only useful data
can improve our analysis.
1:55
We can make our data sets smaller
by removing data we don't want or
1:58
by grouping similar data
together into a single feature.
2:02
In the next few videos,
2:06
we'll talk about ways we can make sure
we're keeping only the best data.
2:07
You need to sign up for Treehouse in order to download course files.
Sign up