Bummer! This is just a preview. You need to be signed in with a Basic account to view the entire video.
Fixing or Excluding Data3:39 with Alyssa Batula
Sometimes a bad data entry can be fixed while other times it is best to remove the data rather than include wrong information.
Data Source Information:
National Health and Nutrition Examination Survey (NHANES) main site
NHANES 1999-2000 information
Body Measurements Documentation/Codebook
Python and Pandas Resources:
Now let's talk about fixing and excluding data from our data set. 0:00 Sometimes a bad data entry can be fixed. 0:04 While other times it's best to remove the data 0:07 rather than include wrong information. 0:09 Usually it's better to fix your data because that lets you keep your data set 0:12 as large as possible. 0:16 In the previous video, we covered some simple issues that can usually be fixed, 0:17 like removing extra weight space or fixing the spellings. 0:21 Other times it's better to remove the problem data. 0:25 This is usually because the data is missing or known to be wrong, but 0:28 you have no way of fixing it by finding the correct value. 0:32 In the first stage of this course, we talked about the importance of 0:35 understanding the data you're working with. 0:38 Knowing your data is essential when deciding if a problem can be fixed, or 0:41 if it needs to be removed. 0:45 We'll discuss the specific cases of removing data in upcoming videos. 0:47 But first let's use a simplified data set 0:51 to illustrate ways to handle removing data. 0:54 Sometimes we can remove a single data entry and still work with the data. 0:58 Here we can see that the X-wing crew size and 1:02 the length of the sail barge are both missing. 1:05 If we could find that data somewhere, 1:08 we could fix these entries by restoring the real values. 1:10 But if we can't find that information, 1:13 we may be able to just leave the data as missing. 1:16 Some analysis methods can handle missing data entries like these, 1:19 while others can't. 1:22 Another option is to fill in any missing values with the average value for 1:24 that column. 1:28 Or you could use a random sample from a normal distribution with the mean and 1:29 standard deviation that matches the data in that column. 1:33 The best option will depend on the dataset in question, and 1:37 the analysis you plan to run. 1:40 Sometimes, we can't leave any missing values in our data set, or 1:42 we can only have a small percentage of missing entries. 1:46 In these cases, we may need to remove an entire row. 1:50 This usually happens when there's a problem with many of the features for 1:53 a particular entry. 1:56 Here we can see that the TIE Fighter has no values listed in any of the columns. 1:58 And this doesn't add any useful information to our dataset. 2:02 So if we can;t replace those entries, we would probably remove the entire entry for 2:05 the TIE Fighter. 2:09 The Death Star is also missing two-thirds of its entries. 2:10 Although it does still have some information, that's a lot of missing data, 2:14 and we may want to remove that row as well. 2:17 Other times it makes more sense to remove an entire column or 2:20 feature from the dataset. 2:23 Here we only have two entries for the crew size of our ships. 2:25 If we need to remove all missing entries from our data set, 2:29 removing all those rows would eliminate most of our data. 2:32 Instead, we could get rid of the Crew column, and keep all of our rows. 2:35 Whether it makes sense to remove rows or 2:39 columns in situations like this depends on your specific use case. 2:41 You'll want to compare how important a feature is to your analysis 2:45 with how many data entries you will lose in order to keep that feature. 2:49 Sometimes we can merge columns into a single feature, 2:53 instead of removing the problematic columns. 2:56 Here we have some passenger information, but a lot of it is missing. 2:59 If we want to keep the information we do have, we could replace the passenger 3:03 feature with a capacity feature that contains a sum of the crew and 3:06 passenger columns, treating all missing information as 0. 3:10 Another option could be to run one analysis on ships with passenger 3:14 information, and a separate analysis on ships without that information. 3:18 Again, the best choice will be different for each dataset and analysis. 3:22 So far in this stage, we've covered simple data fixes and 3:27 talked a little bit about excluding data. 3:31 After this next quiz, 3:33 we'll continue by covering some more complicated data issues, good luck. 3:35
You need to sign up for Treehouse in order to download course files.Sign up