1 00:00:00,000 --> 00:00:04,789 [MUSIC] 2 00:00:04,789 --> 00:00:08,344 It may seem like more data is always better, but we want to make sure 3 00:00:08,344 --> 00:00:12,940 we're only looking at information that's relevant to our problem. 4 00:00:12,940 --> 00:00:15,900 Let's talk about some reasons you may actually want to make your data set 5 00:00:15,900 --> 00:00:19,000 smaller by removing some of the features. 6 00:00:19,000 --> 00:00:20,210 When looking at data, 7 00:00:20,210 --> 00:00:24,290 we're usually looking to find relationships between our features. 8 00:00:24,290 --> 00:00:27,520 Often, there's one feature we're particularly interested in, 9 00:00:27,520 --> 00:00:30,570 sometimes called the target, or dependant variable. 10 00:00:30,570 --> 00:00:34,380 This is the information we want to predict, or learn more about. 11 00:00:34,380 --> 00:00:37,970 The other features are treated as independent variables. 12 00:00:37,970 --> 00:00:42,480 We want to know what influence these features have on our target variable. 13 00:00:42,480 --> 00:00:45,990 Including irrelevant data can waste computing resources. 14 00:00:45,990 --> 00:00:49,110 Storing data takes up space on a computer hard drive. 15 00:00:49,110 --> 00:00:52,700 So keeping data we don't need wastes storage space. 16 00:00:52,700 --> 00:00:56,650 It also takes longer to run an analysis on a larger data set. 17 00:00:56,650 --> 00:01:00,630 If some of that data isn't needed, it makes your computer take extra time 18 00:01:00,630 --> 00:01:04,000 to process data that isn't helping your analysis. 19 00:01:04,000 --> 00:01:06,200 It can also mislead an analysis, 20 00:01:06,200 --> 00:01:10,082 causing it to try to find trends in data that don't actually exist. 21 00:01:10,082 --> 00:01:14,510 Large amounts of data are difficult to visualize in a meaningful way. 22 00:01:14,510 --> 00:01:15,950 The more features we have, 23 00:01:15,950 --> 00:01:20,300 the more difficult it is to make a clear visualization of that data. 24 00:01:20,300 --> 00:01:21,370 With two features, 25 00:01:21,370 --> 00:01:25,480 we can plot the relationship as a simple scatter plot or line. 26 00:01:25,480 --> 00:01:27,190 As the number of features increases, 27 00:01:27,190 --> 00:01:30,230 it becomes more difficult to visualize the data. 28 00:01:30,230 --> 00:01:34,850 We start to need 3D graphs or complex pairings of 2D graphs. 29 00:01:34,850 --> 00:01:36,900 These make it more difficult to look at and 30 00:01:36,900 --> 00:01:39,900 understand the relationships within our data set. 31 00:01:39,900 --> 00:01:42,950 Getting rid of the unnecessary features helps to focus on 32 00:01:42,950 --> 00:01:44,850 only the important information. 33 00:01:44,850 --> 00:01:47,310 And allows us to make a clearer graph. 34 00:01:47,310 --> 00:01:51,300 To address this, we want to choose the most relevant data to explore or 35 00:01:51,300 --> 00:01:52,920 analyze further. 36 00:01:52,920 --> 00:01:55,530 We don't want to get rid of any important data, but 37 00:01:55,530 --> 00:01:58,910 keeping only useful data can improve our analysis. 38 00:01:58,910 --> 00:02:02,660 We can make our data sets smaller by removing data we don't want or 39 00:02:02,660 --> 00:02:06,270 by grouping similar data together into a single feature. 40 00:02:06,270 --> 00:02:07,770 In the next few videos, 41 00:02:07,770 --> 00:02:11,100 we'll talk about ways we can make sure we're keeping only the best data.