✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

🤑 Join the Treehouse affiliate program and earn 25% recurring commission!

New No-Code Track! 🚀 New videos dropping every week—start learning today!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

🤑 Join the Treehouse affiliate program and earn 25% recurring commission!

Well done!

You have completed Preparing Data for Analysis!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Video Player

00:00

00:00

00:00

2x 2x
1.75x 1.75x
1.5x 1.5x
1.25x 1.25x
1.1x 1.1x
1x 1x
0.75x 0.75x
0.5x 0.5x

None
English

Use Up/Down Arrow keys to increase or decrease volume.

Cleaning A CSV

11:21 with Megan Amendola

Use Python's Pandas to remove whitespace and duplicates.

Teacher's Notes
Questions?0
Video Transcript
Downloads
Workspaces

Welcome back. 0:00

We're going to work on cleaning the data set, but 0:01

this time we'll be using pandas inside of JupyterLab. 0:04

You can find a link to a binder so you can follow along in the teachers notes below. 0:08

This will open up in your browser. 0:12

There's also a link to download the CSV if you'd rather open 0:15

up Jupiter lab on your own machine. 0:20

Pause me until you've got everything ready to go, I'll wait here. 0:23

Okay, let's get started. 0:27

First import pandas as pd. 0:29

And run it. 0:36

Then let's get our data, 0:38

data = pd.read_csv pokemon.csv. 0:42

And let's check out the first 10 rows. 0:48

So data.head 10. 0:55

Perfect. 0:59

Let's also run info, so data.info and I can't remember that- 1:02

gives me that version of info we need to do, this version of info. 1:07

Perfect, that's what I want. 1:12

So we can see all of the columns are objects as their type, but 1:14

we know Height and Weight should actually be numerical based columns. 1:19

Height should be some sort of integer and Weight, some sort of float. 1:24

So we can tell we're gonna need to do some fixing here. 1:29

So let's get started. 1:33

First, let's remove Whitespace. 1:35

And I'm actually gonna make a note here of what we're doing and when just 1:37

in case you use this as like notes or reference later, a little bit easier find. 1:43

Okay, so Whitespace. 1:48

We're going to do whitespace first because this can affect finding duplicates, 1:51

and it can also unnecessarily increase the size of our data. 1:55

So I think we should tackle it first. 1:59

We can use strip, like this strip on a column to remove leading and 2:01

trailing Whitespace. 2:07

So we'll need to do this for each column, so 2:10

data, Name is the name of our first column and 2:13

they're also listed right here the names, just in case you need to reference. 2:16

And we'll need to set it equal to the change that we're making so 2:22

that the change is reflected. 2:25

Data.Name, and then you'll .str.strip. 2:28

And I'm just gonna hit Enter because I'm gonna do 2:32

these all in here and actually, let me do this. 2:36

We gonna do copy, paste and then I can come in here and I can do Height. 2:41

And it does have a space and all of this in it. 2:45

So we need to make sure we reflect that in here, inches. 2:48

Actually I'm not gonna copy paste cause that'll remove my other copy paste, 2:54

inches, paste and we can do Weight. 3:02

And that's an lbs just in case it's not clear Weight, lbs. 3:06

I'm doing a command or control and 3:13

end right arrow to get to the end here a little bit faster. 3:16

Some keyboards also have the end button. 3:20

It just helps you get to the end of the line a little bit faster. 3:23

Then we need type I didn't wanna run it yet. 3:27

I wanna do one more, 3:35

Weaknesses. 3:41

Okay, just to make sure we have one, two, three, four, five, perfect. 3:45

Now I can run it and no errors, which means that all worked out perfectly. 3:52

Next, let's find duplicates and I'm gonna do the same thing 3:59

Markdown Duplicates, just to keep us consistent. 4:03

So, we have what we did to tackle Whitespace. 4:08

This is what we're gonna do to tackle Duplicates. 4:11

Makes it easier for reference later. 4:14

Now, pandas has two functions for Duplicates, 4:16

duplicated, and also drop duplicates. 4:23

So we're gonna check out what these both do. 4:28

Let's run duplicated on the dataset and see what we get so data.duplicated and 4:32

you can see I'm just getting a bunch of like boolean values. 4:40

They're all boolean values, which isn't super helpful, right? 4:45

It's not super easy to tell where the duplicates are with this information. 4:49

So let's do something a little bit different. 4:55

Let's add a .sum here at the end. 4:57

And we can see it's finding two duplicates. 5:04

But I know there are more in the dataset. 5:07

So what I need to do is specify which column it should check for repeats. 5:09

In our data set here that's the name column. All of these 5:15

values should be unique. 5:20

Right now it's checking for new unique values across the entire data set. 5:22

So it's actually not finding all of the duplicates that we're looking for. 5:28

So what I need to do is specify that column. 5:34

So it should look something like this 5:37

Duplicated and we're gonna 5:44

do subset equals Name. 5:50

Now you can see we're finding 3 which is if we remember back kind of cheating 5:54

wise we remember back to cleaning the spreadsheet we had 3 duplicates. 6:00

So that is the correct number. 6:05

So right now this is great for showing me how many are in the data set, but I care 6:08

more about like seeing each one so then I can decide on how I want to remove it. 6:13

To find the duplicate, we're going to use loc location l-o-c. 6:19

So it's gonna to look like this. 6:23

I'm gonna leave that there and we're gonna do this on a new line. 6:25

So I'm gonna do data.loc, 6:27

where data.duplicated 6:33

Subset = ["Name"]. 6:39

Okay, perfect. 6:46

So you can see we can see our three duplicates below in our results now. 6:47

Now, I can review this. 6:53

And I can see, for instance, our Blackstoise. 6:55

This is why I did ahead of 10 up here, so we can see, 6:57

it's bringing us the second instance. 7:02

Because this is our first duplicate up here you can see the one that it grabbed 7:05

is the one that doesn't have the 63 inches. 7:09

So it's finding the first one and 7:12

then it's bringing us the second one when we run duplicated. 7:15

So you can see this is what we're getting back. 7:20

Now if I review these duplicates 7:26

I can also add Keep = last. 7:31

And you can see that switched, which one it got for us. 7:37

So before we do Ctrl C, hit Enter. 7:42

You can see we have 9, 48 and 91. 7:47

Now we have 8, 23 and 84. 7:50

So it's grabbing the first instance instead. 7:55

So now we're getting the first instance and 7:59

it's keeping the second one in the data set. 8:02

So you have a little bit of flexibility here depending on which ones you want 8:05

to keep. 8:09

Looking at the duplicates I think I want to keep those first instances. 8:10

Because I don't want to keep the 63 inches because it's incorrect and 8:14

that saves me a little bit of work. 8:19

So now we need to drop the duplicates. 8:22

I'm gonna create this new cell here, and I'm going to call 8:25

drop duplicates and pass in our keep last and subset name. 8:30

So it's going to look like this. 8:35

So I'm gonna do data.drop_duplicates, 8:40

subset = name, going to keep. 8:48

Last. 8:55

And then we're also going to do inplace=True, 8:59

which means we want this to affect our actual data set. 9:04

We want this to replace what's currently in the data set. 9:10

Hit Enter and just so we can see if it's made the change, 9:15

we can do data.head again and call 10. 9:19

Before I run this, let me scroll up. 9:22

So remember, we had both of these and 9:25

now we're trying to get rid of this number 8 here. 9:27

So we're keeping the last instance. 9:34

So if it finds two or even three or four, 9:36

it's going to keep the last one that it finds instead of the first one. 9:39

If I run it opts, We have name. 9:44

We need to make sure that is capital, there we go. 9:52

Cuz our column name is capital. 9:57

So you have to keep it the same. So you can see, it skips 8 now because 8 9:59

is now gone and now we can see what the next Pokemon in our data set is. 10:04

So it removed the correct ones now we don't have to worry about fixing 10:09

that 63 inches anymore. 10:14

One quick note though, and I'm going to add this in Markdown here for 10:16

note taking, add this to this markdown. 10:21

If a name has Whitespace around it, it will not be counted as a duplicate. 10:25

It has to be exact. 10:31

That's why we removed the excess Whitespace first. 10:33

So, As a note, 10:37

I'm gonna do this note if a name has 10:44

Whitespace around it It, will not, 10:51

be counted as a duplicate. 10:58

It has, to be, exact. 11:03

So now, I have that note there, for reference in the future. 11:10

And perfect. 11:14

So far we've tackled Whitespace, we've tackled duplicates, and 11:14

there's still more to go. 11:18

See you in the next video. 11:19