1 00:00:00,380 --> 00:00:04,510 We've got our data and we're ready to start uncovering some hidden truths. 2 00:00:04,510 --> 00:00:07,480 But first, before we start analyzing anything, 3 00:00:07,480 --> 00:00:11,370 it's usually a good idea to think about where the data comes from. 4 00:00:11,370 --> 00:00:16,020 The data we receive isn't always guaranteed to be 100% accurate. 5 00:00:16,020 --> 00:00:20,450 When analyzing the accuracy of your data, you need to think about a few things. 6 00:00:20,450 --> 00:00:23,350 Including the source of the information, the methods for 7 00:00:23,350 --> 00:00:26,890 collecting the data and the way the data is measured. 8 00:00:26,890 --> 00:00:30,450 Now, for this example, our data comes from a pretty good source. 9 00:00:30,450 --> 00:00:33,900 The Boston Athletic Association keeps detailed records, and 10 00:00:33,900 --> 00:00:36,120 they let you search through the results. 11 00:00:36,120 --> 00:00:39,510 However, they don't just let you download all the results. 12 00:00:39,510 --> 00:00:40,750 That would be way too easy. 13 00:00:41,840 --> 00:00:45,530 Instead, we rely on other data scientists to write code to 14 00:00:45,530 --> 00:00:50,370 repeatedly search through the results and collect them all into a CSV file. 15 00:00:50,370 --> 00:00:53,530 This is the first bit of messiness that we need to be aware of. 16 00:00:53,530 --> 00:00:56,160 Our data doesn't come directly from the source. 17 00:00:56,160 --> 00:01:01,260 We're relying on somebody else to collect the data without making any mistakes. 18 00:01:01,260 --> 00:01:02,878 So if we see anything strange, 19 00:01:02,878 --> 00:01:07,173 like somebody running the whole marathon in less than an hour, [SOUND] we'd want to 20 00:01:07,173 --> 00:01:11,306 double check that with the official results before continuing our analysis. 21 00:01:11,306 --> 00:01:13,700 Or more likely, finding a new data source. 22 00:01:14,930 --> 00:01:18,210 Another example of messy data would be survey results. 23 00:01:18,210 --> 00:01:22,540 Unlike the Boston Marathon where the runner's times are automatically recorded. 24 00:01:22,540 --> 00:01:23,620 With surveys, 25 00:01:23,620 --> 00:01:27,420 we have to deal with the possibility that respondents are lying to us. 26 00:01:27,420 --> 00:01:31,550 Or more likely they're being influenced by a cognitive bias. 27 00:01:31,550 --> 00:01:35,400 One such cognitive bias is the social desirability bias. 28 00:01:35,400 --> 00:01:38,090 When answering a survey, respondents have a tendency to 29 00:01:38,090 --> 00:01:41,370 respond in a way that will make them look good to others. 30 00:01:41,370 --> 00:01:42,370 For example, 31 00:01:42,370 --> 00:01:46,350 when the dentist asks how frequently you floss, do you tell the truth? 32 00:01:47,410 --> 00:01:49,700 Turns out, one in four of you don't. 33 00:01:49,700 --> 00:01:52,740 Which leaves this awesome headline by N P R. 34 00:01:52,740 --> 00:01:56,120 Are you flossing, or just lying about flossing? 35 00:01:56,120 --> 00:02:00,990 But the social desirability bias doesn't only mean over-reporting good behaviors. 36 00:02:00,990 --> 00:02:04,280 It can also mean under-reporting bad behaviors. 37 00:02:04,280 --> 00:02:08,620 If a survey asks how frequently you use recreational drugs or how many sexual 38 00:02:08,620 --> 00:02:13,680 partners you had, it's pretty unlikely that everyone's going to tell the truth. 39 00:02:13,680 --> 00:02:17,722 In fact, getting accurate data about recreational drug use is so 40 00:02:17,722 --> 00:02:21,763 hard that scientists have turned to testing waste water to try and 41 00:02:21,763 --> 00:02:25,453 figure out which substances are being used in a community. 42 00:02:25,453 --> 00:02:30,178 Luckily we don't have to go quite that far to make sure we have usable data. 43 00:02:30,178 --> 00:02:33,392 Though just because our data is automatically recorded, 44 00:02:33,392 --> 00:02:35,440 doesn't mean that it's accurate. 45 00:02:35,440 --> 00:02:38,830 Take, for example, the step counter on your phone or smart watch. 46 00:02:38,830 --> 00:02:42,370 I don't know about you, but mine's not particularly accurate. 47 00:02:42,370 --> 00:02:44,620 Sometimes I get in the car to go to work, and 48 00:02:44,620 --> 00:02:47,500 by the time I've arrived, I've added another hundred steps. 49 00:02:48,550 --> 00:02:50,240 Now for something like step counting, 50 00:02:50,240 --> 00:02:53,510 it's probably not a big deal to have a few extra. 51 00:02:53,510 --> 00:02:56,790 But what if we had a step competition with people using 52 00:02:56,790 --> 00:02:59,970 all kinds of different devices to record their steps? 53 00:02:59,970 --> 00:03:04,170 All of a sudden, those extra steps would matter a whole lot. 54 00:03:04,170 --> 00:03:07,320 What if somebody else had a much more accurate step counter? 55 00:03:07,320 --> 00:03:08,970 And on that same drive, 56 00:03:08,970 --> 00:03:13,230 instead of recording an extra hundred steps, it's perfectly accurate. 57 00:03:13,230 --> 00:03:16,430 It wouldn't really be fair to compare our steps directly. 58 00:03:16,430 --> 00:03:22,030 So instead, before any analysis, we would want to correct for any extra steps. 59 00:03:22,030 --> 00:03:25,570 This process is known as cleaning or preparing your data. 60 00:03:25,570 --> 00:03:29,610 Before doing an analysis, you want to make sure that your data is valid. 61 00:03:29,610 --> 00:03:32,860 This can be as simple as combining several misspellings of a name 62 00:03:32,860 --> 00:03:34,490 into one category or 63 00:03:34,490 --> 00:03:38,860 as difficult as trying to figure out which responses are genuine in an online survey. 64 00:03:40,040 --> 00:03:43,250 Lucky for us, our data is already pretty clean. 65 00:03:43,250 --> 00:03:46,110 But we do have a few unused columns. 66 00:03:46,110 --> 00:03:49,460 The first column seems to be an unnecessary line number. 67 00:03:49,460 --> 00:03:52,140 Column J is just completely empty, and 68 00:03:52,140 --> 00:03:55,470 the Projected Time column is empty for all runners. 69 00:03:55,470 --> 00:04:00,230 We can safely delete each of these columns by right-clicking on the column header and 70 00:04:00,230 --> 00:04:02,580 selecting Delete column. 71 00:04:02,580 --> 00:04:07,210 Data cleaning and preparing is the real unsung hero of data analytics. 72 00:04:07,210 --> 00:04:10,680 It takes a lot of work to turn raw information into good, 73 00:04:10,680 --> 00:04:12,910 valid data, that we can analyze. 74 00:04:12,910 --> 00:04:15,660 And, I can't stress enough how important it is to 75 00:04:15,660 --> 00:04:18,280 understand where your data comes from. 76 00:04:18,280 --> 00:04:20,670 That's enough about dealing with messy data. 77 00:04:20,670 --> 00:04:23,480 Coming up, we'll finally start doing some analysis. 78 00:04:23,480 --> 00:04:24,940 And by the end of this course, 79 00:04:24,940 --> 00:04:27,820 you'll be ready to draw insights from all the data all around you.