1 00:00:00,510 --> 00:00:04,320 We've talked about why you might want to remove features from your data set. 2 00:00:04,320 --> 00:00:07,580 So now, let's talk about how to choose the right data. 3 00:00:07,580 --> 00:00:10,230 To pick the right data, you need to consider both the data 4 00:00:10,230 --> 00:00:14,020 available in the data set and the question you're trying to answer with your data. 5 00:00:15,190 --> 00:00:17,650 Let's start with a thought exercise. 6 00:00:17,650 --> 00:00:21,670 What if we're interested in factors that are related to a person's height? 7 00:00:21,670 --> 00:00:24,430 What data might be relevant or related? 8 00:00:24,430 --> 00:00:25,800 Age is a good candidate. 9 00:00:25,800 --> 00:00:28,510 Since we know children grow as they age, 10 00:00:28,510 --> 00:00:32,730 we may want to explore the exact relationship between age and height. 11 00:00:32,730 --> 00:00:36,480 Whether a person is male or female can also affect their height, so 12 00:00:36,480 --> 00:00:39,170 we probably want to include that data too. 13 00:00:39,170 --> 00:00:42,260 Weight information could also be interesting to look at. 14 00:00:42,260 --> 00:00:45,700 There is probably some relationship between how tall a person is and 15 00:00:45,700 --> 00:00:49,150 how much they weigh, which could be worth exploring. 16 00:00:49,150 --> 00:00:51,260 What about a person's shoe size? 17 00:00:51,260 --> 00:00:54,380 Is there a relationship between height and shoe size? 18 00:00:54,380 --> 00:00:55,620 There could be, so 19 00:00:55,620 --> 00:01:00,090 we can keep that data in the data set if we're interested in that relationship. 20 00:01:00,090 --> 00:01:02,090 Next up is education. 21 00:01:02,090 --> 00:01:05,440 Could there be a relationship between a person's education and height? 22 00:01:05,440 --> 00:01:09,760 [SOUND] It seems unlikely, so we may not want to keep the education data, 23 00:01:09,760 --> 00:01:14,087 unless we're interested in exploring whether or not that is a factor. 24 00:01:14,087 --> 00:01:17,380 Education brings up another interesting possibility. 25 00:01:17,380 --> 00:01:21,640 That data could actually be correlated, not because of a direct relationship, but 26 00:01:21,640 --> 00:01:25,200 because of an indirect relationship to a third feature. 27 00:01:25,200 --> 00:01:29,230 For example, you could find that height increases with education level. 28 00:01:29,230 --> 00:01:32,399 But does that really mean tall people are more educated? 29 00:01:32,399 --> 00:01:36,537 If height increases with age and people advance through education levels as 30 00:01:36,537 --> 00:01:41,122 they grow older, it would make sense that height would also increase with education 31 00:01:41,122 --> 00:01:45,890 level, even though there's no actual effect of education on height. 32 00:01:45,890 --> 00:01:50,017 There is a related phrase, correlation does not imply causation. 33 00:01:50,017 --> 00:01:54,364 This is another way of saying that just because two features appear to be related, 34 00:01:54,364 --> 00:01:56,510 that does not mean one causes the other. 35 00:01:57,530 --> 00:02:00,350 We could also consider birth year in our data. 36 00:02:00,350 --> 00:02:03,500 This could be interesting when considered together with age, 37 00:02:03,500 --> 00:02:07,100 if we can get measurements with people taken in many years. 38 00:02:07,100 --> 00:02:11,820 For example, is there a height difference in 20 year olds born in the 1980s 39 00:02:11,820 --> 00:02:15,240 compared to 20 year olds born in the 1880s? 40 00:02:15,240 --> 00:02:19,190 This type of information could indicate changes in how tall people were 41 00:02:19,190 --> 00:02:21,730 in different decades or centuries. 42 00:02:21,730 --> 00:02:25,110 Finally, what about the day of the month the person is born? 43 00:02:25,110 --> 00:02:27,480 This probably isn't useful data. 44 00:02:27,480 --> 00:02:30,900 There isn't anything inherently meaningful about the numbers 1 to 31, so 45 00:02:30,900 --> 00:02:34,760 we could safely remove this from our data set. 46 00:02:34,760 --> 00:02:39,100 That was one example of how to look at data to choose the best features. 47 00:02:39,100 --> 00:02:43,320 We also want to consider how much data is missing from each column. 48 00:02:43,320 --> 00:02:46,420 Remember that missing data often needs to be removed from a data set. 49 00:02:47,600 --> 00:02:52,300 So here's another example, we have two education columns, one for youth and 50 00:02:52,300 --> 00:02:53,820 one for adults. 51 00:02:53,820 --> 00:02:56,260 Each column has many missing entries, and 52 00:02:56,260 --> 00:02:59,600 a lot of these entries are only missing because they're separated by age. 53 00:03:01,020 --> 00:03:03,224 What if we want to look at education levels for 54 00:03:03,224 --> 00:03:06,590 everyone together instead of in two separate columns? 55 00:03:06,590 --> 00:03:09,500 Could we make a single column with fewer missing data points 56 00:03:09,500 --> 00:03:11,770 than either of the original columns? 57 00:03:11,770 --> 00:03:13,430 This would get rid of some of our missing data, 58 00:03:13,430 --> 00:03:16,620 and would also reduce the number of features. 59 00:03:16,620 --> 00:03:20,494 Let's do a hands-on example to see how we could combine the columns. 60 00:03:20,494 --> 00:03:24,260 We'll use the data files you saved from the last stage in this course. 61 00:03:24,260 --> 00:03:28,380 Or you can download the exact files I'll be using from the teacher's notes. 62 00:03:28,380 --> 00:03:31,890 Make sure your data files are in the same directory you're working in. 63 00:03:31,890 --> 00:03:35,420 Otherwise, you'll need to specify the directory where it's stored. 64 00:03:35,420 --> 00:03:38,640 There's also a link where you can download the code I'll be using. 65 00:03:38,640 --> 00:03:42,220 Go ahead and pause this video while you download any files you need and 66 00:03:42,220 --> 00:03:44,460 set up your programming environment. 67 00:03:44,460 --> 00:03:46,350 Everything ready, then let's get started. 68 00:03:47,990 --> 00:03:51,420 To start, we need to import our libraries and our data set again. 69 00:03:51,420 --> 00:03:53,864 We'll only be using the demographics data set for this. 70 00:03:59,308 --> 00:04:03,660 Now, let's take a look at the codebooks to see how we can combine the two columns. 71 00:04:03,660 --> 00:04:07,810 For youth, it breaks it down into individual grades or no school. 72 00:04:07,810 --> 00:04:12,080 There are also more general categories that only indicate less than fifth grade 73 00:04:12,080 --> 00:04:13,530 or less than ninth grade. 74 00:04:13,530 --> 00:04:16,940 Let's compare that to the information available for adults. 75 00:04:16,940 --> 00:04:19,570 There are many fewer categories for adults. 76 00:04:19,570 --> 00:04:23,830 Unfortunately, we're limited by the column with the least detailed information. 77 00:04:23,830 --> 00:04:27,050 We could never split these categories into individual years, so 78 00:04:27,050 --> 00:04:30,160 we have to compress the youth column to have fewer categories. 79 00:04:30,160 --> 00:04:32,340 We need to come up with a new codebook. 80 00:04:32,340 --> 00:04:34,070 Here's one possible table. 81 00:04:34,070 --> 00:04:37,550 We've simplified the available codes into less than high school, 82 00:04:37,550 --> 00:04:41,050 high school diploma or equivalent, or more than high school. 83 00:04:41,050 --> 00:04:44,640 There's also two more codes for refused and unknown. 84 00:04:44,640 --> 00:04:47,090 First, let's make a new education column. 85 00:04:47,090 --> 00:04:50,290 We'll give it the column title of Education, and to start off, 86 00:04:50,290 --> 00:04:52,120 we'll fill it with missing values. 87 00:04:52,120 --> 00:04:56,240 All the categories for grades 1 through 12 need to be replaced with a 1. 88 00:04:56,240 --> 00:04:58,820 We'll store the matching rows in a variable called index. 89 00:05:02,268 --> 00:05:05,393 We need to select all of the codes less than 13. 90 00:05:14,322 --> 00:05:16,497 We also need to select codes 55 and 91 00:05:16,497 --> 00:05:21,152 66, which correspond to less than fifth grade and less than ninth grade. 92 00:05:25,538 --> 00:05:27,968 Once we have those selected from the youth column, 93 00:05:27,968 --> 00:05:31,257 we need to update their value to equal 1 in the new Education column. 94 00:05:33,458 --> 00:05:37,113 13 and 14 are the codes for high school diploma and GED. 95 00:05:52,872 --> 00:05:57,174 We use the same method to replace these with our new code of 2. 96 00:05:57,174 --> 00:06:00,379 And again, replace the code 15 with 3. 97 00:06:10,354 --> 00:06:13,109 Replace 77 with 7, for refused. 98 00:06:19,743 --> 00:06:22,935 And finally, we replace 99 with 9 for unknown. 99 00:06:26,631 --> 00:06:30,050 Now we do the same thing, but for the adult education column. 100 00:06:30,050 --> 00:06:34,557 This time, only codes 1 and 2 are grouped into the high school code of 1. 101 00:06:42,622 --> 00:06:45,520 Diploma and GED are coded as 3. 102 00:06:45,520 --> 00:06:48,011 So we select those values and recode them to be 2. 103 00:06:54,782 --> 00:06:59,448 Regroup both 4 and 5 into category 3, the new category for more than high school. 104 00:07:19,080 --> 00:07:22,300 Refused and unknown are already coded as 7 and 9. 105 00:07:22,300 --> 00:07:26,029 So we select those and keep the same value in the new Education column. 106 00:07:34,733 --> 00:07:38,840 Now we can drop the two original columns from our data set, using the drop method. 107 00:07:38,840 --> 00:07:42,292 We're using the in-place argument again to tell pandas to drop the rows from 108 00:07:42,292 --> 00:07:44,856 our current data frame, instead of returning a new one. 109 00:07:56,744 --> 00:08:00,640 And then we can take a look at our updated data set, using the describe method. 110 00:08:05,169 --> 00:08:07,420 We can see in the new education column. 111 00:08:07,420 --> 00:08:10,070 Now we have a single column with fewer missing entries 112 00:08:10,070 --> 00:08:12,200 than either of the other two columns. 113 00:08:12,200 --> 00:08:15,223 If you want to keep your new version, don't forget to save your changes. 114 00:08:27,033 --> 00:08:28,970 That's it for this video. 115 00:08:28,970 --> 00:08:32,320 To recap, we talked about how you might choose which features to keep 116 00:08:32,320 --> 00:08:33,760 in a data analysis. 117 00:08:33,760 --> 00:08:36,930 Then we went over an example of how you could combine two features 118 00:08:36,930 --> 00:08:40,320 into a single feature, to compensate for missing data. 119 00:08:40,320 --> 00:08:41,940 After this next quiz, 120 00:08:41,940 --> 00:08:44,990 we'll talk about ways we can have the computer pick features for us.