Bummer! This is just a preview. You need to be signed in with a Basic account to view the entire video.
Start a free Basic trial
to watch this video
There are algorithms that can help decide which features are most relevant, or even do the feature selection themselves, and this video introduces the basics of a few simple methods.
New Terms:
- Correlation -- Mutual relationship or interdependence between different features. A measure of how much features change together, where 1 indicates perfect positive correlation, -1 indicates perfect negative correlation, and 0 indicates no correlation.
Data Files:
Data Source Information:
-
National Health and Nutrition Examination Survey (NHANES) main site
-
NHANES 1999-2000 information
-
Demographics Documentation/Codebook
-
Body Measurements Documentation/Codebook
-
Occupation Documentation/Codebook
Further Reading:
Scikit Learn Documentation: Feature selection
An Introduction to Feature Selection, by Jason Brownlee
The Practical Importance of Feature Selection, by Matthew Kayo
An Introduction to Variable and Feature Selection [PDF], by Isabelle Guyon and André Elisseeff, edited by Leslie Pack Kaelbling
mlR Documentation: Feature Selection (R-language based)
-
0:00
So far, we've discussed how we can choose which features to keep.
-
0:04
We've also shown an example where we combine two features into a single column.
-
0:08
But there are also ways that a computer can help automatically select
-
0:12
which features to keep.
-
0:14
Having a computer help with these decisions can speed up our analysis and
-
0:17
help us make more objective decisions on our data.
-
0:21
One of the simplest ways to do this is to look at the correlations
-
0:24
between features in our data set.
-
0:26
Correlation is an objective,
-
0:28
numerical measure of how much two different features change together.
-
0:32
For example, perfect positive correlation has a value of one and
-
0:36
means that they increase together.
-
0:38
So for our first feature increase,
-
0:39
our second feature would also increase by proportional amount.
-
0:43
A perfect negative correlation would have a value of negative one.
-
0:47
In this case, whenever our first feature increases,
-
0:50
our second feature decreases by a proportional amount.
-
0:54
A value of zero means there is no correlation.
-
0:57
An increase or
-
0:58
decrease in one feature has no impact on the value of the other feature.
-
1:02
We'll be using our saved data set, again.
-
1:05
You can use your saved files from previous lessons, or
-
1:08
you can download the exact files I'll be using from the teacher's notes.
-
1:12
There's also a link where you can download the code that I'll be using.
-
1:15
Go ahead and pause this video while you download any files you need and
-
1:18
set up your programming environment.
-
1:20
Now that you've got everything set up, it's time to do some coding.
-
1:25
As usual, we need to import pandas and load in our data.
-
1:28
This time, we'll be working with a body measures data.
-
1:33
Now let's take a look at the correlation between three features, weight,
-
1:37
height, and arm circumference.
-
1:40
We can do this by selecting our three columns and
-
1:42
using the CORR method of our data frame.
-
1:50
Our output is a table with each of our column names as a row and a column label.
-
1:55
We can see a diagonal through the table that has all values of one.
-
1:59
These diagonal cells show the correlation between each feature and itself.
-
2:04
So the correlation between weight and weight between height and height and
-
2:07
between arm circumference and arm circumference.
-
2:11
It makes sense for these values to be one,
-
2:12
a perfect correlation, because we're comparing the same data to itself.
-
2:17
The more interesting entries are the comparisons between different variables.
-
2:21
There's a correlation between height and weight with a value of 0.79.
-
2:25
The correlation between weight and
-
2:27
arm circumference is very high with a value of 0.96.
-
2:31
The smallest correlation is between arm circumference and height.
-
2:35
We can use these values on their own to explore and learn about our data set.
-
2:39
They're interesting matrix for
-
2:41
finding relationships between the different features.
-
2:44
But we can also use them to select the features in our data set.
-
2:48
For example, if we want to find information that can predict a persons
-
2:51
height, we might choose to keep only those features with a height correlation.
-
2:55
There are 38 columns in the binding measures set.
-
2:59
Let's keep only the features with a correlation of at least 0.7 with
-
3:03
height first.
-
3:04
We need to get the correlation matrix and save it to a variable.
-
3:08
Next, we select just the row with the correlations for height.
-
3:21
Then we find all the columns where the value in that row is above 0.7.
-
3:25
We store these values in a variable called index.
-
3:33
We can use this index to display the names of the columns we're keeping.
-
3:43
We have six remaining features including height itself.
-
3:46
The sequence column, SEQN, isn't included
-
3:49
because there isn't a high correlation between our participant index and height.
-
3:53
But we wanna keep our index column anyway, so
-
3:56
we can match up our participants across the separate data files.
-
3:59
So let's say it's value and our index to true.
-
4:07
Then we create a new data frame containing all the rows of our original body measures
-
4:11
data frame, but only the columns with the correlation to height above 0.7.
-
4:23
We can see a summary of our new data set with the describe method.
-
4:32
Finally, don't forget to save your new data set.
-
4:49
That was just one way we can have the computer select our features for us.
-
4:54
It is a relatively simple method,
-
4:56
looking only at the correlation between a single feature and our target variable.
-
5:00
There are many other methods available to help us with feature selection.
-
5:04
As usual with data analysis, the method we choose will depend on which one works best
-
5:09
for our analysis and our data set.
-
5:12
We could select features using other statistical tests aside from correlation.
-
5:17
We could also consider the relationships between multiple features at once
-
5:20
instead of one at a time.
-
5:23
For example, looking at weight and
-
5:25
arm circumference together could be more useful than either one of them alone.
-
5:29
Other methods pair with advanced prediction algorithms to
-
5:32
select the features that predict the target with the highest accuracy.
-
5:36
These methods are outside the scope of this course, but there are links to more
-
5:40
information in the teacher's notes if you want to learn more about them.
-
5:43
That's it for this lesson, and you're almost done with this course.
-
5:47
We've learned a lot about data cleaning,
-
5:49
and done a lot of practice with our data set.
-
5:51
In our next video, we'll have a recap of what we've learned.
You need to sign up for Treehouse in order to download course files.
Sign up