Choosing the Right Features8:45 with Alyssa Batula
Selecting the right features requires you to consider both the data available in the dataset and the question(s) you want to answer with that data.
- Correlation -- Mutual relationship or interdependence between different features.
Data Source Information:
We've talked about why you might want to remove features from your data set. 0:00 So now, let's talk about how to choose the right data. 0:04 To pick the right data, you need to consider both the data 0:07 available in the data set and the question you're trying to answer with your data. 0:10 Let's start with a thought exercise. 0:15 What if we're interested in factors that are related to a person's height? 0:17 What data might be relevant or related? 0:21 Age is a good candidate. 0:24 Since we know children grow as they age, 0:25 we may want to explore the exact relationship between age and height. 0:28 Whether a person is male or female can also affect their height, so 0:32 we probably want to include that data too. 0:36 Weight information could also be interesting to look at. 0:39 There is probably some relationship between how tall a person is and 0:42 how much they weigh, which could be worth exploring. 0:45 What about a person's shoe size? 0:49 Is there a relationship between height and shoe size? 0:51 There could be, so 0:54 we can keep that data in the data set if we're interested in that relationship. 0:55 Next up is education. 1:00 Could there be a relationship between a person's education and height? 1:02 [SOUND] It seems unlikely, so we may not want to keep the education data, 1:05 unless we're interested in exploring whether or not that is a factor. 1:09 Education brings up another interesting possibility. 1:14 That data could actually be correlated, not because of a direct relationship, but 1:17 because of an indirect relationship to a third feature. 1:21 For example, you could find that height increases with education level. 1:25 But does that really mean tall people are more educated? 1:29 If height increases with age and people advance through education levels as 1:32 they grow older, it would make sense that height would also increase with education 1:36 level, even though there's no actual effect of education on height. 1:41 There is a related phrase, correlation does not imply causation. 1:45 This is another way of saying that just because two features appear to be related, 1:50 that does not mean one causes the other. 1:54 We could also consider birth year in our data. 1:57 This could be interesting when considered together with age, 2:00 if we can get measurements with people taken in many years. 2:03 For example, is there a height difference in 20 year olds born in the 1980s 2:07 compared to 20 year olds born in the 1880s? 2:11 This type of information could indicate changes in how tall people were 2:15 in different decades or centuries. 2:19 Finally, what about the day of the month the person is born? 2:21 This probably isn't useful data. 2:25 There isn't anything inherently meaningful about the numbers 1 to 31, so 2:27 we could safely remove this from our data set. 2:30 That was one example of how to look at data to choose the best features. 2:34 We also want to consider how much data is missing from each column. 2:39 Remember that missing data often needs to be removed from a data set. 2:43 So here's another example, we have two education columns, one for youth and 2:47 one for adults. 2:52 Each column has many missing entries, and 2:53 a lot of these entries are only missing because they're separated by age. 2:56 What if we want to look at education levels for 3:01 everyone together instead of in two separate columns? 3:03 Could we make a single column with fewer missing data points 3:06 than either of the original columns? 3:09 This would get rid of some of our missing data, 3:11 and would also reduce the number of features. 3:13 Let's do a hands-on example to see how we could combine the columns. 3:16 We'll use the data files you saved from the last stage in this course. 3:20 Or you can download the exact files I'll be using from the teacher's notes. 3:24 Make sure your data files are in the same directory you're working in. 3:28 Otherwise, you'll need to specify the directory where it's stored. 3:31 There's also a link where you can download the code I'll be using. 3:35 Go ahead and pause this video while you download any files you need and 3:38 set up your programming environment. 3:42 Everything ready, then let's get started. 3:44 To start, we need to import our libraries and our data set again. 3:47 We'll only be using the demographics data set for this. 3:51 Now, let's take a look at the codebooks to see how we can combine the two columns. 3:59 For youth, it breaks it down into individual grades or no school. 4:03 There are also more general categories that only indicate less than fifth grade 4:07 or less than ninth grade. 4:12 Let's compare that to the information available for adults. 4:13 There are many fewer categories for adults. 4:16 Unfortunately, we're limited by the column with the least detailed information. 4:19 We could never split these categories into individual years, so 4:23 we have to compress the youth column to have fewer categories. 4:27 We need to come up with a new codebook. 4:30 Here's one possible table. 4:32 We've simplified the available codes into less than high school, 4:34 high school diploma or equivalent, or more than high school. 4:37 There's also two more codes for refused and unknown. 4:41 First, let's make a new education column. 4:44 We'll give it the column title of Education, and to start off, 4:47 we'll fill it with missing values. 4:50 All the categories for grades 1 through 12 need to be replaced with a 1. 4:52 We'll store the matching rows in a variable called index. 4:56 We need to select all of the codes less than 13. 5:02 We also need to select codes 55 and 5:14 66, which correspond to less than fifth grade and less than ninth grade. 5:16 Once we have those selected from the youth column, 5:25 we need to update their value to equal 1 in the new Education column. 5:27 13 and 14 are the codes for high school diploma and GED. 5:33 We use the same method to replace these with our new code of 2. 5:52 And again, replace the code 15 with 3. 5:57 Replace 77 with 7, for refused. 6:10 And finally, we replace 99 with 9 for unknown. 6:19 Now we do the same thing, but for the adult education column. 6:26 This time, only codes 1 and 2 are grouped into the high school code of 1. 6:30 Diploma and GED are coded as 3. 6:42 So we select those values and recode them to be 2. 6:45 Regroup both 4 and 5 into category 3, the new category for more than high school. 6:54 Refused and unknown are already coded as 7 and 9. 7:19 So we select those and keep the same value in the new Education column. 7:22 Now we can drop the two original columns from our data set, using the drop method. 7:34 We're using the in-place argument again to tell pandas to drop the rows from 7:38 our current data frame, instead of returning a new one. 7:42 And then we can take a look at our updated data set, using the describe method. 7:56 We can see in the new education column. 8:05 Now we have a single column with fewer missing entries 8:07 than either of the other two columns. 8:10 If you want to keep your new version, don't forget to save your changes. 8:12 That's it for this video. 8:27 To recap, we talked about how you might choose which features to keep 8:28 in a data analysis. 8:32 Then we went over an example of how you could combine two features 8:33 into a single feature, to compensate for missing data. 8:36 After this next quiz, 8:40 we'll talk about ways we can have the computer pick features for us. 8:41
You need to sign up for Treehouse in order to download course files.Sign up