Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Use tools in Google Sheets to clean a dataset.
Update
- "Remove Duplicates" in Google Sheets can now be found under
Data > Data Cleanup > Remove Duplicates
Resources
Duplicates Conditional Formatting Formula
=countif(A:A,A1)>1
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
[MUSIC]
0:00
Hello again,
0:04
in this video we're going to practice
cleaning a data set using Google Sheets.
0:05
If you haven't already, check the teacher's
notes for a link to this dataset so
0:10
you can follow along with me.
0:14
Pause me until you're ready.
0:16
Ready?
0:19
Okay, let's go.
0:20
First let's tackle the duplicates.
0:22
Google Sheets has a built in
feature now to remove duplicates.
0:25
You just have to choose a column that
should have only unique values and
0:29
that will remove them.
0:34
First, let's select the entire data set.
0:36
Then we're gonna go into data.
0:40
And we're gonna scroll
down to remove duplicates.
0:43
And then we don't wanna select all we just
want Column A the name column because
0:49
all of the names should be unique values.
0:54
There should only be one of
each Pokemon in our data set.
0:56
Don't forget that we have
a duplicate right here at the top.
1:01
I put that in there on purpose so
1:05
that we can watch what happens
when I hit Remove duplicates.
1:07
And if you paid attention to
our 2 duplicates right here,
1:12
it kept the first one that it came
across and deleted the second one.
1:16
This works great if the duplicates
contain the exact same
1:21
data to quickly remove
them from your data set.
1:26
Now, if you don't know if the data for
1:30
every column is an exact duplicate,
you can do this instead.
1:32
First, let's Control Z to undo and
bring our duplicates back.
1:37
And now we're gonna do some
conditional formatting.
1:43
By the way, conditional formatting
is probably going to be your new
1:47
best friend when it comes to
cleaning data in a spreadsheet.
1:51
So, I'm going to Google
how to find duplicates.
1:54
And we wanted to make sure
it's in Google Sheets.
2:00
So I'm gonna Enter.
2:03
Okay, let's see what we have.
2:08
Google Sheets' duplicates feature,
that's what we just did.
2:12
So let's see what else they
suggest that we can do.
2:15
Here we go,
highlight duplicates using colors.
2:19
So we're gonna open our Google Sheet,
2:22
we're gonna select the column
that we want to use.
2:25
The, we're gonna go to Format,
Conditional Formatting.
2:28
Wanna make sure we select
the correct range,
2:34
which would be our entire names column.
2:38
We're gonna choose Custom formula and
we're gonna put
2:42
in this formula which is
essentially gonna see if there
2:46
is more than one of a unique
element in the first column.
2:51
So let's do that now.
2:56
We wanna select the entire Name column
because all these values should be unique.
2:58
And we're gonna go format
conditional formatting.
3:03
Our range is already selected, but if you
need to change it, you can check this
3:05
select data range and edit and
make sure you have the correct range data.
3:10
We need custom formula is.
3:15
Paste in that formula you can see
it's already found our 2 duplicates
3:19
here at the top.
3:24
And then you can choose
what other color formatting
3:25
etc you wanna use I'm just
gonna leave it as green.
3:29
And I can close this out and
it'll keep this here.
3:33
So now this allows me to see
my duplicates, so there's one,
3:36
there's an Ekans and here's another Ekans.
3:42
I keep scrolling.
3:47
There's Farfetch'd and another Farfetch'd.
3:49
So now I can check through
these duplicates and
3:52
see if one is better to
keep than the other.
3:55
So for instance, it looks like both
of Farfetch'd are exactly the same.
3:58
So that one it doesn't
matter which one we keep.
4:04
I can get a quick look.
4:07
Okay, scroll up to the other one.
4:10
That one also looks like it doesn't
matter the exact duplicate.
4:14
So the big one here is this Blastoise.
4:17
We have 63 inches, which is not correct.
4:19
So it'd be easier for
us to Delete this first row.
4:23
And then for the others, let's say
we can just Delete the first one.
4:29
Now scroll till we get to Farfetch'd,
Delete.
4:36
Perfect with the duplicates removed,
4:40
let's focus on missing data
which includes empty cells or
4:44
cells with none or
not a number or unknown.
4:49
I'm going to use conditional
formatting again to find these cells.
4:53
I'm gonna go back to the top.
4:57
Select everything Format,
Conditional formatting.
5:00
And since we found all of our duplicates,
you can leave it here it'll
5:06
re highlight anything if you
accidentally make a duplicate.
5:10
Or you can Delete it because we're not
adding anything to the database to
5:13
the table.
5:17
So we shouldn't have a problem
with duplicates any more.
5:18
We're gonna add another rule,
we're gonna apply it to everything.
5:22
And we're gonna say is empty.
5:26
So that's gonna highlight all of our
empty ones, those are gonna be green.
5:28
So let's scroll until
we find an empty cell.
5:32
So there we go.
5:35
We have two empty cells and
it's highlighting them in green for us.
5:35
So let's hit Done.
5:39
Let's add another rule.
5:41
Let's do text contains and
let's do Nan and
5:43
let's do this a different color just so
it's easier to tell.
5:47
Do this just a blue.
5:52
Actually I'll do it yellow, okay.
5:56
Let's do a little bit of scrolling just
kind of give it a check to make sure it
5:59
found the right thing.
6:02
So here we go we got a Nan there Done.
6:03
The other thing and
I can see it right here we have Unknown so
6:08
let's do texts contains Unknown.
6:12
And let's do that red.
6:15
My if I spell it correctly.
6:18
There we go.
6:21
You see it's caught it down there.
6:22
And I hit Done.
6:25
So I have these three conditional
formatting rules that
6:26
are finding all of our missing data cells.
6:30
This is where some decision
making needs to happen.
6:34
Luckily for us, Pokemon information
is all easily searchable.
6:37
But let's pretend it's not for a second.
6:42
I find it's best to be
consistent with missing data.
6:45
So if we can't find the correct values,
6:49
I would make all the missing
number of values,
6:53
either Nan or leave them blank.
6:58
For text cells,
I would probably just leave them blank.
7:01
If most of the data is missing,
you will probably need to decide if
7:05
deleting the entire row
would be the better option.
7:10
For instance, if all of your analysis
questions relate to these two columns,
7:14
and the data for
these two columns is missing for this row,
7:19
then I would probably just remove the row.
7:23
Just delete it.
7:26
Since these values are easily searchable,
let's put in the correct values.
7:28
So I'm going to start at the top.
7:32
Okay, so we have Golbat.
7:42
And this is from Pokemon.com.
7:52
So this is the source of truth,
7:54
I guess you could say so height, 5 foot 3.
7:57
These are in inches so 5 times 12 is 63.
8:03
And then this one is oops
8:09
wait 121.3.
8:14
We need Machoke.
8:26
Now what I'm on here I'm gonna
see if we can just search
8:28
Here we go.
8:43
And what were we missing?
8:45
We're missing the type.
8:47
So type is fighting. Awesome,
keep scrolling.
8:49
We have Seel.
8:57
What type is Seel?
8:58
Seel is water.
9:07
We have Goldeen,
9:15
Also type water.
9:26
Moltres, and
9:36
we are missing,
I already forgot the Height or
9:43
the Weight actually 132.3.
9:48
And then Mew we're missing the Height.
9:55
Which is 1 foot 4 inches.
10:07
Which is 12 plus 4 which is 16.
10:11
Wonderful, I'm gonna
scroll back up through,
10:16
and we just have this one
Unknown Pokemon right here.
10:21
Now I'm gonna check because
they are relatively in order.
10:26
I'm gonna check what's after, Nidoking.
10:30
So if I go look at Nidoking and
you can see, it lets us do this.
10:33
So it has Clefairy as next.
10:41
And then Clefable, if you're a Pokemon
person is the next one in the list.
10:45
So I'm just gonna check so
it should be 24 for
10:53
Height 16.5 for Weight 24, 16.5 Fairy.
10:57
Steel and poison.
11:02
Yep, so this is Clefairy.
11:03
Perfect though it's probably
the hardest one to find.
11:12
And that's all of the values.
11:16
In the next video,
we'll continue cleaning this dataset.
11:18
Awesome job.
11:21
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up