What is Data Cleaning?3:51 with Alyssa Batula
In this video, we will discuss what is meant by cleaning or scrubbing a dataset, and why it’s an important step in data analysis.
- Data Cleaning -- The process of fixing or removing incorrect, incomplete, and irrelevant data from a dataset. Also called data cleansing, preparing, or scrubbing.
- Example -- A single observation, case, or member of a dataset, usually a row in a table.
- Feature -- A descriptive or measurable characteristic of an example in a dataset, usually a column in a table. Also called a variable.
- Raw Data -- Data that has been collected but not cleaned. Also called source, primary, or atomic data.
Why the ‘Boring’ Part of Data Science is Actually the Most Interesting
Data Science: A Kaggle Walkthrough Pt 1: Introduction
Data Science: A Kaggle Walkthrough Pt 2: Understanding the Data
Data Science: A Kaggle Walkthrough Pt 3: Cleaning Data
[MUSIC] 0:00 Hi, am Alyssa, a computer programmer and data analyst. 0:09 Welcome to the Cleaning and Preparing Data course. 0:13 There are a lot of resources available for learning how to analyse data. 0:16 But most assume your dataset has already been cleaned. 0:20 In this course you'll get hands on experience 0:23 learning how to clean a dataset. 0:26 Also called data cleansing, repairing, or scrubbing. 0:28 This is an, an important step to making sure your data is relevant and accurate. 0:31 Data cleaning is typically the biggest time commitment 0:36 in any data analysis project. 0:39 Often taking up as much as 80% of your time and effort. 0:41 As you go through this course, we'll explore a practice dataset, and 0:45 discuss the problems we can find during data scrubbing. 0:48 We'll talk about how to deal with these problems, and 0:52 how to decide which data to include or exclude from our dataset. 0:55 We'll use hands-on examples with a real dataset so 0:59 you can practice your new skills and get real-world experience. 1:02 By the end of this course, you should be able to prepare your own dataset for 1:06 use in a data analysis project. 1:10 You can adjust the speed of each video if you need to slow it down. 1:12 And feel free to pause a video so you can practice what you just learned. 1:16 It's also a good idea to read the teacher's notes for each video for 1:20 additional information and resources. 1:23 Remember to check the prerequisites for this course. 1:26 These videos will assume you're familiar with programming and 1:29 Python, as well as the NumPy and Panda's libraries. 1:31 If you don't know Python, NumPy or Pandas yet, check the teacher's notes for 1:35 links to courses that can teach you what you need to know. 1:39 Now that we've covered the course essentials, let's get started. 1:43 Data cleaning, preparing, or scrubbing is the term for 1:46 fixing any issues in a dataset. 1:48 So that all data is present, relevant, and correct. 1:51 You usually do this when a dataset is first collected, or 1:54 when multiple sources of data are merged into a new larger dataset. 1:57 It's also an ongoing process as you can discover new problems with the data 2:01 at any point in your analysis. 2:05 A common way of arranging data is in table format, 2:08 like you would see in a spreadsheet or database query. 2:11 In this set up each row represents and example or observation, and 2:14 each column represents a feature. 2:18 A feature is a descriptive or measurable characteristic such as Length, 2:20 Crew capacity, or Lightspeed capability of fictional vehicles. 2:25 Each examples contains all the feature values related to a single unit 2:29 such as the Millennium Falcon. 2:33 Data preparation is a very important step before trying to get any information 2:35 out of a dataset. 2:39 The results of your analysis can only be as good as the quality of the data itself. 2:40 Any mistakes in the data will cause mistakes in your results. 2:45 A common way of phrasing this idea is garbage in, garbage out. 2:49 There are many sources of data available on the Internet. 2:53 While some of these datasets have already been cleaned. 2:56 You will likely need to create your own dataset to meet the needs of your specific 2:59 problem and questions. 3:02 You may need to gather raw or unclean data from various sources. 3:05 Or you may be given a dataset that has been collected, but not cleaned. 3:09 Even if a cleaned data set is available, 3:14 it's often beneficial to look at the raw data yourself. 3:17 And understand the steps taken during the data cleaning process. 3:20 Cleaning a dataset involves many decisions on how to deal with 3:24 problematic data entries. 3:27 And you may want to deal with them differently than the person who originally 3:29 clean the dataset. 3:32 I've included resources for finding data as well as further reading on data 3:34 cleaning in the teacher's notes for this video. 3:38 The rest of this course we'll walk you through the process of scrubbing 3:41 your practice dataset. 3:44 In our next video, 3:45 we'll outline the types of problems you will find incorrect throughout the course. 3:47
You need to sign up for Treehouse in order to download course files.Sign up