Automated Feature Selection5:55 with Alyssa Batula
There are algorithms that can help decide which features are most relevant, or even do the feature selection themselves, and this video introduces the basics of a few simple methods.
- Correlation -- Mutual relationship or interdependence between different features. A measure of how much features change together, where 1 indicates perfect positive correlation, -1 indicates perfect negative correlation, and 0 indicates no correlation.
Data Source Information:
National Health and Nutrition Examination Survey (NHANES) main site
NHANES 1999-2000 information
Body Measurements Documentation/Codebook
Scikit Learn Documentation: Feature selection
An Introduction to Feature Selection, by Jason Brownlee
The Practical Importance of Feature Selection, by Matthew Kayo
An Introduction to Variable and Feature Selection [PDF], by Isabelle Guyon and André Elisseeff, edited by Leslie Pack Kaelbling
mlR Documentation: Feature Selection (R-language based)
So far, we've discussed how we can choose which features to keep. 0:00 We've also shown an example where we combine two features into a single column. 0:04 But there are also ways that a computer can help automatically select 0:08 which features to keep. 0:12 Having a computer help with these decisions can speed up our analysis and 0:14 help us make more objective decisions on our data. 0:17 One of the simplest ways to do this is to look at the correlations 0:21 between features in our data set. 0:24 Correlation is an objective, 0:26 numerical measure of how much two different features change together. 0:28 For example, perfect positive correlation has a value of one and 0:32 means that they increase together. 0:36 So for our first feature increase, 0:38 our second feature would also increase by proportional amount. 0:39 A perfect negative correlation would have a value of negative one. 0:43 In this case, whenever our first feature increases, 0:47 our second feature decreases by a proportional amount. 0:50 A value of zero means there is no correlation. 0:54 An increase or 0:57 decrease in one feature has no impact on the value of the other feature. 0:58 We'll be using our saved data set, again. 1:02 You can use your saved files from previous lessons, or 1:05 you can download the exact files I'll be using from the teacher's notes. 1:08 There's also a link where you can download the code that I'll be using. 1:12 Go ahead and pause this video while you download any files you need and 1:15 set up your programming environment. 1:18 Now that you've got everything set up, it's time to do some coding. 1:20 As usual, we need to import pandas and load in our data. 1:25 This time, we'll be working with a body measures data. 1:28 Now let's take a look at the correlation between three features, weight, 1:33 height, and arm circumference. 1:37 We can do this by selecting our three columns and 1:40 using the CORR method of our data frame. 1:42 Our output is a table with each of our column names as a row and a column label. 1:50 We can see a diagonal through the table that has all values of one. 1:55 These diagonal cells show the correlation between each feature and itself. 1:59 So the correlation between weight and weight between height and height and 2:04 between arm circumference and arm circumference. 2:07 It makes sense for these values to be one, 2:11 a perfect correlation, because we're comparing the same data to itself. 2:12 The more interesting entries are the comparisons between different variables. 2:17 There's a correlation between height and weight with a value of 0.79. 2:21 The correlation between weight and 2:25 arm circumference is very high with a value of 0.96. 2:27 The smallest correlation is between arm circumference and height. 2:31 We can use these values on their own to explore and learn about our data set. 2:35 They're interesting matrix for 2:39 finding relationships between the different features. 2:41 But we can also use them to select the features in our data set. 2:44 For example, if we want to find information that can predict a persons 2:48 height, we might choose to keep only those features with a height correlation. 2:51 There are 38 columns in the binding measures set. 2:55 Let's keep only the features with a correlation of at least 0.7 with 2:59 height first. 3:03 We need to get the correlation matrix and save it to a variable. 3:04 Next, we select just the row with the correlations for height. 3:08 Then we find all the columns where the value in that row is above 0.7. 3:21 We store these values in a variable called index. 3:25 We can use this index to display the names of the columns we're keeping. 3:33 We have six remaining features including height itself. 3:43 The sequence column, SEQN, isn't included 3:46 because there isn't a high correlation between our participant index and height. 3:49 But we wanna keep our index column anyway, so 3:53 we can match up our participants across the separate data files. 3:56 So let's say it's value and our index to true. 3:59 Then we create a new data frame containing all the rows of our original body measures 4:07 data frame, but only the columns with the correlation to height above 0.7. 4:11 We can see a summary of our new data set with the describe method. 4:23 Finally, don't forget to save your new data set. 4:32 That was just one way we can have the computer select our features for us. 4:49 It is a relatively simple method, 4:54 looking only at the correlation between a single feature and our target variable. 4:56 There are many other methods available to help us with feature selection. 5:00 As usual with data analysis, the method we choose will depend on which one works best 5:04 for our analysis and our data set. 5:09 We could select features using other statistical tests aside from correlation. 5:12 We could also consider the relationships between multiple features at once 5:17 instead of one at a time. 5:20 For example, looking at weight and 5:23 arm circumference together could be more useful than either one of them alone. 5:25 Other methods pair with advanced prediction algorithms to 5:29 select the features that predict the target with the highest accuracy. 5:32 These methods are outside the scope of this course, but there are links to more 5:36 information in the teacher's notes if you want to learn more about them. 5:40 That's it for this lesson, and you're almost done with this course. 5:43 We've learned a lot about data cleaning, 5:47 and done a lot of practice with our data set. 5:49 In our next video, we'll have a recap of what we've learned. 5:51
You need to sign up for Treehouse in order to download course files.Sign up