Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Use Python's Pandas to remove whitespace and duplicates.
Binder is no longer available for this project, please follow along by downloading the project files from the downloads tab.
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Welcome back.
0:00
We're going to work on
cleaning the data set, but
0:01
this time we'll be using
pandas inside of JupyterLab.
0:04
You can find a link to a binder so you can
follow along in the teachers notes below.
0:08
This will open up in your browser.
0:12
There's also a link to download
the CSV if you'd rather open
0:15
up Jupiter lab on your own machine.
0:20
Pause me until you've got everything
ready to go, I'll wait here.
0:23
Okay, let's get started.
0:27
First import pandas as pd.
0:29
And run it.
0:36
Then let's get our data,
0:38
data = pd.read_csv pokemon.csv.
0:42
And let's check out the first 10 rows.
0:48
So data.head 10.
0:55
Perfect.
0:59
Let's also run info, so
data.info and I can't remember that-
1:02
gives me that version of info we
need to do, this version of info.
1:07
Perfect, that's what I want.
1:12
So we can see all of the columns
are objects as their type, but
1:14
we know Height and Weight should
actually be numerical based columns.
1:19
Height should be some sort of integer and
Weight, some sort of float.
1:24
So we can tell we're gonna
need to do some fixing here.
1:29
So let's get started.
1:33
First, let's remove Whitespace.
1:35
And I'm actually gonna make a note
here of what we're doing and when just
1:37
in case you use this as like notes or
reference later, a little bit easier find.
1:43
Okay, so Whitespace.
1:48
We're going to do whitespace first because
this can affect finding duplicates,
1:51
and it can also unnecessarily
increase the size of our data.
1:55
So I think we should tackle it first.
1:59
We can use strip, like this strip
on a column to remove leading and
2:01
trailing Whitespace.
2:07
So we'll need to do this for
each column, so
2:10
data, Name is the name
of our first column and
2:13
they're also listed right here the names,
just in case you need to reference.
2:16
And we'll need to set it equal to
the change that we're making so
2:22
that the change is reflected.
2:25
Data.Name, and then you'll .str.strip.
2:28
And I'm just gonna hit
Enter because I'm gonna do
2:32
these all in here and
actually, let me do this.
2:36
We gonna do copy, paste and then I
can come in here and I can do Height.
2:41
And it does have a space and
all of this in it.
2:45
So we need to make sure we
reflect that in here, inches.
2:48
Actually I'm not gonna copy paste cause
that'll remove my other copy paste,
2:54
inches, paste and we can do Weight.
3:02
And that's an lbs just in case
it's not clear Weight, lbs.
3:06
I'm doing a command or control and
3:13
end right arrow to get to the end
here a little bit faster.
3:16
Some keyboards also have the end button.
3:20
It just helps you get to the end
of the line a little bit faster.
3:23
Then we need type I
didn't wanna run it yet.
3:27
I wanna do one more,
3:35
Weaknesses.
3:41
Okay, just to make sure we have one,
two, three, four, five, perfect.
3:45
Now I can run it and no errors,
which means that all worked out perfectly.
3:52
Next, let's find duplicates and
I'm gonna do the same thing
3:59
Markdown Duplicates,
just to keep us consistent.
4:03
So, we have what we did
to tackle Whitespace.
4:08
This is what we're gonna
do to tackle Duplicates.
4:11
Makes it easier for reference later.
4:14
Now, pandas has two functions for
Duplicates,
4:16
duplicated, and also drop duplicates.
4:23
So we're gonna check
out what these both do.
4:28
Let's run duplicated on the dataset and
see what we get so data.duplicated and
4:32
you can see I'm just getting
a bunch of like boolean values.
4:40
They're all boolean values,
which isn't super helpful, right?
4:45
It's not super easy to tell where
the duplicates are with this information.
4:49
So let's do something
a little bit different.
4:55
Let's add a .sum here at the end.
4:57
And we can see it's
finding two duplicates.
5:04
But I know there are more in the dataset.
5:07
So what I need to do is specify which
column it should check for repeats.
5:09
In our data set here that's
the name column. All of these
5:15
values should be unique.
5:20
Right now it's checking for new unique
values across the entire data set.
5:22
So it's actually not finding all of
the duplicates that we're looking for.
5:28
So what I need to do is
specify that column.
5:34
So it should look something like this
5:37
Duplicated and we're gonna
5:44
do subset equals Name.
5:50
Now you can see we're finding 3 which
is if we remember back kind of cheating
5:54
wise we remember back to cleaning
the spreadsheet we had 3 duplicates.
6:00
So that is the correct number.
6:05
So right now this is great for showing me
how many are in the data set, but I care
6:08
more about like seeing each one so then
I can decide on how I want to remove it.
6:13
To find the duplicate,
we're going to use loc location l-o-c.
6:19
So it's gonna to look like this.
6:23
I'm gonna leave that there and
we're gonna do this on a new line.
6:25
So I'm gonna do data.loc,
6:27
where data.duplicated
6:33
Subset = ["Name"].
6:39
Okay, perfect.
6:46
So you can see we can see our three
duplicates below in our results now.
6:47
Now, I can review this.
6:53
And I can see, for
instance, our Blackstoise.
6:55
This is why I did ahead of 10 up here,
so we can see,
6:57
it's bringing us the second instance.
7:02
Because this is our first duplicate up
here you can see the one that it grabbed
7:05
is the one that doesn't
have the 63 inches.
7:09
So it's finding the first one and
7:12
then it's bringing us the second
one when we run duplicated.
7:15
So you can see this is
what we're getting back.
7:20
Now if I review these duplicates
7:26
I can also add Keep = last.
7:31
And you can see that switched,
which one it got for us.
7:37
So before we do Ctrl C, hit Enter.
7:42
You can see we have 9, 48 and 91.
7:47
Now we have 8, 23 and 84.
7:50
So it's grabbing the first
instance instead.
7:55
So now we're getting
the first instance and
7:59
it's keeping the second
one in the data set.
8:02
So you have a little bit of flexibility
here depending on which ones you want
8:05
to keep.
8:09
Looking at the duplicates I think I
want to keep those first instances.
8:10
Because I don't want to keep the 63
inches because it's incorrect and
8:14
that saves me a little bit of work.
8:19
So now we need to drop the duplicates.
8:22
I'm gonna create this new cell here,
and I'm going to call
8:25
drop duplicates and
pass in our keep last and subset name.
8:30
So it's going to look like this.
8:35
So I'm gonna do data.drop_duplicates,
8:40
subset = name, going to keep.
8:48
Last.
8:55
And then we're also going
to do inplace=True,
8:59
which means we want this to
affect our actual data set.
9:04
We want this to replace what's
currently in the data set.
9:10
Hit Enter and just so
we can see if it's made the change,
9:15
we can do data.head again and call 10.
9:19
Before I run this, let me scroll up.
9:22
So remember, we had both of these and
9:25
now we're trying to get
rid of this number 8 here.
9:27
So we're keeping the last instance.
9:34
So if it finds two or even three or four,
9:36
it's going to keep the last one that
it finds instead of the first one.
9:39
If I run it opts, We have name.
9:44
We need to make sure that is capital,
there we go.
9:52
Cuz our column name is capital.
9:57
So you have to keep it the same. So
you can see, it skips 8 now because 8
9:59
is now gone and now we can see what
the next Pokemon in our data set is.
10:04
So it removed the correct ones now
we don't have to worry about fixing
10:09
that 63 inches anymore.
10:14
One quick note though, and
I'm going to add this in Markdown here for
10:16
note taking, add this to this markdown.
10:21
If a name has Whitespace around it,
it will not be counted as a duplicate.
10:25
It has to be exact.
10:31
That's why we removed
the excess Whitespace first.
10:33
So, As a note,
10:37
I'm gonna do this note if a name has
10:44
Whitespace around it It, will not,
10:51
be counted as a duplicate.
10:58
It has, to be, exact.
11:03
So now, I have that note there,
for reference in the future.
11:10
And perfect.
11:14
So far we've tackled Whitespace,
we've tackled duplicates, and
11:14
there's still more to go.
11:18
See you in the next video.
11:19
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up