Making Your Dataset Smaller2:11 with Alyssa Batula
Sometimes it seems that “the more, the merrier” applies to the amount of data you have, but including irrelevant data can waste computing resources or even mislead your analysis.
- Target -- The feature we want to predict or learn more about. Also called the dependent variable.
- Dependent Variable -- Variable or feature whose value depends on one or more other features
- Independent Variable -- Variable or feature whose value does not depend on other features.
[MUSIC] 0:00 It may seem like more data is always better, but we want to make sure 0:04 we're only looking at information that's relevant to our problem. 0:08 Let's talk about some reasons you may actually want to make your data set 0:12 smaller by removing some of the features. 0:15 When looking at data, 0:19 we're usually looking to find relationships between our features. 0:20 Often, there's one feature we're particularly interested in, 0:24 sometimes called the target, or dependant variable. 0:27 This is the information we want to predict, or learn more about. 0:30 The other features are treated as independent variables. 0:34 We want to know what influence these features have on our target variable. 0:37 Including irrelevant data can waste computing resources. 0:42 Storing data takes up space on a computer hard drive. 0:45 So keeping data we don't need wastes storage space. 0:49 It also takes longer to run an analysis on a larger data set. 0:52 If some of that data isn't needed, it makes your computer take extra time 0:56 to process data that isn't helping your analysis. 1:00 It can also mislead an analysis, 1:04 causing it to try to find trends in data that don't actually exist. 1:06 Large amounts of data are difficult to visualize in a meaningful way. 1:10 The more features we have, 1:14 the more difficult it is to make a clear visualization of that data. 1:15 With two features, 1:20 we can plot the relationship as a simple scatter plot or line. 1:21 As the number of features increases, 1:25 it becomes more difficult to visualize the data. 1:27 We start to need 3D graphs or complex pairings of 2D graphs. 1:30 These make it more difficult to look at and 1:34 understand the relationships within our data set. 1:36 Getting rid of the unnecessary features helps to focus on 1:39 only the important information. 1:42 And allows us to make a clearer graph. 1:44 To address this, we want to choose the most relevant data to explore or 1:47 analyze further. 1:51 We don't want to get rid of any important data, but 1:52 keeping only useful data can improve our analysis. 1:55 We can make our data sets smaller by removing data we don't want or 1:58 by grouping similar data together into a single feature. 2:02 In the next few videos, 2:06 we'll talk about ways we can make sure we're keeping only the best data. 2:07
You need to sign up for Treehouse in order to download course files.Sign up