This course will be retired on January 8, 2022.
Types of Bad Data5:18 with Alyssa Batula
In this video, we will outline many of the potential problems a dataset can have.
- Saturation -- When a measurement reaches or exceeds the maximum or minimum value that can be recorded.
- NaN -- Not a number, indicates an invalid number entry.
Types of bad data:
- Formatting errors (e.g. extra whitespace, misspellings)
- Incorrect data type (e.g. numerical or string entries)
- Nonsensical data entries (e.g. age < 0)
- Duplicate entries (duplicate rows or columns)
- Missing data (e.g. NaN)
- Saturated data (e.g. value beyond a measurement limit)
- Systematic and individual errors (error affects many entries or only one)
- Confidential information (e.g. personally identifying or private information)
Kaggle competition datasets
National Health and Nutrition Examination Survey
The Star Wars API (SWAPI)
Why the ‘Boring’ Part of Data Science is Actually the Most Interesting
Data Science: A Kaggle Walkthrough Pt 1: Introduction
Data Science: A Kaggle Walkthrough Pt 2: Understanding the Data
Data Science: A Kaggle Walkthrough Pt 3: Cleaning Data
You need to sign up for Treehouse in order to download course files.Sign up