Getting Good Data is Hard4:28 with Ben Deitch
It's not always so easy to collect good data. In this video we'll look at a few common issues with data collection and talk about how we could handle them.
We've got our data and we're ready to start uncovering some hidden truths. 0:00 But first, before we start analyzing anything, 0:04 it's usually a good idea to think about where the data comes from. 0:07 The data we receive isn't always guaranteed to be 100% accurate. 0:11 When analyzing the accuracy of your data, you need to think about a few things. 0:16 Including the source of the information, the methods for 0:20 collecting the data and the way the data is measured. 0:23 Now, for this example, our data comes from a pretty good source. 0:26 The Boston Athletic Association keeps detailed records, and 0:30 they let you search through the results. 0:33 However, they don't just let you download all the results. 0:36 That would be way too easy. 0:39 Instead, we rely on other data scientists to write code to 0:41 repeatedly search through the results and collect them all into a CSV file. 0:45 This is the first bit of messiness that we need to be aware of. 0:50 Our data doesn't come directly from the source. 0:53 We're relying on somebody else to collect the data without making any mistakes. 0:56 So if we see anything strange, 1:01 like somebody running the whole marathon in less than an hour, [SOUND] we'd want to 1:02 double check that with the official results before continuing our analysis. 1:07 Or more likely, finding a new data source. 1:11 Another example of messy data would be survey results. 1:14 Unlike the Boston Marathon where the runner's times are automatically recorded. 1:18 With surveys, 1:22 we have to deal with the possibility that respondents are lying to us. 1:23 Or more likely they're being influenced by a cognitive bias. 1:27 One such cognitive bias is the social desirability bias. 1:31 When answering a survey, respondents have a tendency to 1:35 respond in a way that will make them look good to others. 1:38 For example, 1:41 when the dentist asks how frequently you floss, do you tell the truth? 1:42 Turns out, one in four of you don't. 1:47 Which leaves this awesome headline by N P R. 1:49 Are you flossing, or just lying about flossing? 1:52 But the social desirability bias doesn't only mean over-reporting good behaviors. 1:56 It can also mean under-reporting bad behaviors. 2:00 If a survey asks how frequently you use recreational drugs or how many sexual 2:04 partners you had, it's pretty unlikely that everyone's going to tell the truth. 2:08 In fact, getting accurate data about recreational drug use is so 2:13 hard that scientists have turned to testing waste water to try and 2:17 figure out which substances are being used in a community. 2:21 Luckily we don't have to go quite that far to make sure we have usable data. 2:25 Though just because our data is automatically recorded, 2:30 doesn't mean that it's accurate. 2:33 Take, for example, the step counter on your phone or smart watch. 2:35 I don't know about you, but mine's not particularly accurate. 2:38 Sometimes I get in the car to go to work, and 2:42 by the time I've arrived, I've added another hundred steps. 2:44 Now for something like step counting, 2:48 it's probably not a big deal to have a few extra. 2:50 But what if we had a step competition with people using 2:53 all kinds of different devices to record their steps? 2:56 All of a sudden, those extra steps would matter a whole lot. 2:59 What if somebody else had a much more accurate step counter? 3:04 And on that same drive, 3:07 instead of recording an extra hundred steps, it's perfectly accurate. 3:08 It wouldn't really be fair to compare our steps directly. 3:13 So instead, before any analysis, we would want to correct for any extra steps. 3:16 This process is known as cleaning or preparing your data. 3:22 Before doing an analysis, you want to make sure that your data is valid. 3:25 This can be as simple as combining several misspellings of a name 3:29 into one category or 3:32 as difficult as trying to figure out which responses are genuine in an online survey. 3:34 Lucky for us, our data is already pretty clean. 3:40 But we do have a few unused columns. 3:43 The first column seems to be an unnecessary line number. 3:46 Column J is just completely empty, and 3:49 the Projected Time column is empty for all runners. 3:52 We can safely delete each of these columns by right-clicking on the column header and 3:55 selecting Delete column. 4:00 Data cleaning and preparing is the real unsung hero of data analytics. 4:02 It takes a lot of work to turn raw information into good, 4:07 valid data, that we can analyze. 4:10 And, I can't stress enough how important it is to 4:12 understand where your data comes from. 4:15 That's enough about dealing with messy data. 4:18 Coming up, we'll finally start doing some analysis. 4:20 And by the end of this course, 4:23 you'll be ready to draw insights from all the data all around you. 4:24
You need to sign up for Treehouse in order to download course files.Sign up