Types of Bad Data5:18 with Alyssa Batula
In this video, we will outline many of the potential problems a dataset can have.
- Saturation -- When a measurement reaches or exceeds the maximum or minimum value that can be recorded.
- NaN -- Not a number, indicates an invalid number entry.
Types of bad data:
- Formatting errors (e.g. extra whitespace, misspellings)
- Incorrect data type (e.g. numerical or string entries)
- Nonsensical data entries (e.g. age < 0)
- Duplicate entries (duplicate rows or columns)
- Missing data (e.g. NaN)
- Saturated data (e.g. value beyond a measurement limit)
- Systematic and individual errors (error affects many entries or only one)
- Confidential information (e.g. personally identifying or private information)
Kaggle competition datasets
National Health and Nutrition Examination Survey
The Star Wars API (SWAPI)
Why the ‘Boring’ Part of Data Science is Actually the Most Interesting
Data Science: A Kaggle Walkthrough Pt 1: Introduction
Data Science: A Kaggle Walkthrough Pt 2: Understanding the Data
Data Science: A Kaggle Walkthrough Pt 3: Cleaning Data
Let's outline the main types of issues you may run into when cleaning a dataset. 0:00 Later, we'll look at each of these in more detail, and discuss ways to manage them. 0:05 The first type of data error is formatting errors, such as misspellings, or 0:10 extra white space around text entries. 0:14 They make data appear different to a computer, 0:16 when in reality they should be identical. 0:18 Formatting errors can be caused from a typo during data entry or 0:20 from combining data sources that use different formatting. 0:24 For example, one data set may spell out yes and 0:26 no while another may use only the letters Y and N. 0:29 In this table we have formatting errors in the Lightspeed Capable column. 0:33 Some entries spell out Yes and No while others use only the first letter, Y and N. 0:36 Even though Y and Yes mean the same thing, they appear different to the computer. 0:42 Also we have a misspelled entry where yess has an extra s. 0:47 Data entries can also have the wrong type. 0:51 For example, text or string entries in a column that should be all numbers. 0:54 These can be caused by a data that's stored or loaded incorrectly, 0:58 such as storing a row of numbers as text instead of numerical values. 1:02 It can also be caused by a data entry error, 1:06 such as typing your name in a field meant to hold your age. 1:08 This table has two problematic entries, the crew size for 1:11 the X-wing is entered as the word, one, instead of a number. 1:15 The Death Star also has an entry of, 1, for light speed capable, while 1 and 1:18 0 are sometimes used in place of Yes and No. 1:23 The rest of the column uses text entries. 1:26 Some data entries may follow proper formatting, but be impossible or 1:29 incredibly unlikely. 1:33 For example physical values such as age weight and length could never be negative. 1:34 So any negative values for those features are errors. 1:39 Here our X-wing has a negative length, which is physically impossible. 1:42 It also has an entry of 11 for 1:47 Lightspeed Capable, which is a meaningless value in this context. 1:48 Datasets can also have duplicated rows or 1:52 columns, where an example or feature has been added more than once. 1:55 Duplications are redundant, 1:59 causing your dataset to take up more space on your hard drive than it needs. 2:01 They can also bias your results by putting too much emphasis 2:05 on their repeated entries. 2:08 Duplications can be caused by a data entry error or 2:09 by incorrectly combining data from multiple sources that have some overlap. 2:12 In our example, we have two rows for the X-wing. 2:17 We also have a duplicated column. 2:20 Crew and Crew Capacity both contain the same information. 2:22 How many crew members can fit aboard the ship? 2:26 Sometimes there are values that are missing from the dataset 2:29 because the information is not available. 2:31 These are often represented by NaN or not a number value. 2:34 But can also be represented by another text or 2:39 numeric entry indicating that the information is unknown. 2:41 The table lists the light speed capability of the Death Star as unknown. 2:45 The X-wing also has not a number in place of the crew entry. 2:50 Both of these entries are missing from our example dataset. 2:53 Data can also have saturation values or 2:58 values at the extreme limits of measurement. 3:00 For example, a thermometer may be able to read temperatures from negative 20 degree 3:02 Fahrenheit to 120 degrees Fahrenheit. 3:07 Or negative 29 degrees Celsius to 49 degrees Celsius even if 3:09 the actual temperature were colder than that range, for example, 3:13 negative 30 degrees Fahrenheit. 3:18 The thermometer would still only show a temperature of negative 20 degrees. 3:20 We don't know the true value of any data entries at the saturation limits. 3:24 Here, we have temperature measurements taken with this thermometer 3:29 that can only read up to 120 degrees Fahrenheit. 3:32 The morning and evening temperatures are fine. 3:35 But the measurements from 12 to 3 are saturated. 3:37 Since the thermometer can't go any higher than 120 degrees, 3:41 we don't know what the actual temperature was at those times. 3:44 As we discover problems in our data set, 3:47 it's important to consider whether they are individual or systematic. 3:49 An individual error only affects a single value like a typo during data entry. 3:54 A systematic error is one that affects all, or 3:58 large portion, of the data in a similar fashion. 4:01 An example of systematic error would be a one foot ruler 4:05 that's actually only 11 inches long. 4:08 Every entry measured with that ruler would be recorded as longer than it actual is, 4:11 causing a systematic error. 4:16 Finally, while it's not an error, 4:18 it's important to consider confidential information in datasets. 4:20 You may want to remove personal information from a data set to 4:24 prevent individual people from being identified. 4:27 In our example, we have credit card numbers which are highly confidential and 4:30 should never be included in any kind of data table like this. 4:33 We also have names and addresses which we may want to remove for 4:37 anonymity reasons depending on who have access to the dataset. 4:41 To review, we've discussed eight types of data errors or problems. 4:45 Formatting errors, incorrect data type, nonsensical data entries, 4:49 duplicate entries, missing data, saturated data, 4:54 systematic and individual errors, and confidential information. 4:58 We'll discuss each of these problems in more detail throughout this course. 5:04 After a short quiz, we'll continue by talking about datasets and 5:08 understanding our data. 5:11 This will be important in later videos when we decide how to deal with this 5:13 problematic data. 5:17
You need to sign up for Treehouse in order to download course files.Sign up