Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Welcome! In this video, we'll introduce you to the merge() function and the arguments used to successfully merge two datasets.
Data files
Definitions
- DataFrame - a two-dimensional data structure, arranged in rows and columns
Similar terms
- DataFrame ⇔ dataset
- record ⇔ row
- field ⇔ column
Load data into pandas
billboard = pd.read_csv("Billboard_100_2017-2018.csv", index_col="ID")
spotify = pd.read_csv("Spotify_200_2017-2018.csv", index_col="ID")
Create DataFrames for Ariana Grande Billboard and Spotify song data
ariana_bill = billboard[billboard["Artists"].str.contains("Ariana Grande")]
ariana_spot = spotify[spotify["Artists"].str.contains("Ariana Grande")]
Merge Ariana Billboard dataset and Ariana Spotify dataset
ariana_combined = pd.merge(ariana_bill, ariana_spot, how='left', on=['Name', 'Artists', 'BB.Week'])
Additional Resources
- Pandas API: merge() function
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Welcome.
0:00
In this video, we will set up
our data frames in Pandas,
0:01
then merge the data from two data
frames into one combined data frame.
0:04
Download the two CSV files from the
teacher's notes to make sure you remember
0:10
where you save them.
0:13
Anaconda navigator is open on my screen.
0:15
We'll run JupyterLab.
0:17
So I click the Launch button
in the JupyterLab cell.
0:19
It'll launch a new tab
in my browser window.
0:26
This is my file tree on the left,
0:29
I'll navigate to where I
downloaded my CSV files.
0:32
I created a folder specifically for
this workshop,
0:35
Documents > anaconda > combine-data.
0:39
CSV stands for comma separated values.
0:44
A CSV file is a text file.
0:47
Each line of text is a row, and the commas
separate each line into columns.
0:50
The first line of text
contains the headings and
0:57
each additional line is a record of data.
0:59
Billboard_100_2017-2018 shows
1:03
the Billboard 100 chart
data from 2017 to 2018.
1:08
JupyterLab displays the file as
a chart in rows and columns.
1:13
Spotify_200_2017-2018 contains
the Spotify 200 data from those years.
1:17
Back on my Launcher tab,
I want to create a new notebook.
1:31
So I'll click Python3 under Notebook.
1:34
I can rename the notebook now,
right-click or
1:38
Ctrl-click the notebook name,
choose Rename,
1:41
Then enter billboard_spotify.
1:49
Let's first import the Pandas package and
check the version.
1:56
import pandas as pd.
2:01
pd.__version__.
2:09
I'll click the Run button
here in the toolbar.
2:15
The keyboard shortcut to run
the code in the cell is Shift+Enter,
2:18
I'll use them moving forward.
2:21
Pressing Enter just adds
a new line within this cell,
2:23
like I did here with input and version.
2:26
But this shows that Pandas is installed,
your version may be higher.
2:29
Then let's load the data from each
CSV file into a new data frame.
2:34
Billboard = pd.read_csv
2:41
("Billboard_100_2017-
2:48
2018.csv",
2:59
index_col="ID").
3:05
spotify =
3:13
pd.read_csv("Spotify_200_2017-2018.csv,
3:15
index_col="ID").
3:31
The index_col tells Pandas that
the ID field is the index or
3:41
unique identifier for each record.
3:44
In other words, no two rows in a data
frame should have the same ID number.
3:48
We identify this field so that Pandas
doesn't add its own index column.
3:52
Let's use a couple of Pandas tools to
get a sense of what we just imported.
3:57
First .shape will show us
the dimensions of each data frame.
4:01
Billboard.shape, spotify.shape.
4:08
This shows us there are over 10,000
records in the Billboard data set,
4:16
with 7 columns other than the index.
4:20
There are more than 20,000 rows and 5
non-index columns in the Spotify data set.
4:23
Let's get a peek at the first few rows of
each data frame using the head method.
4:30
Billboard.head().
4:37
This shows us the column headings
and the first five rows of data.
4:42
By the way, if I wanted to see eight rows,
for example, I would say head(8),
4:47
like this spotify.head(8).
4:53
But five is the default if I
don't specify an argument.
5:04
And this is fine because I really only
care about the headings right now.
5:07
Both data frames have a name field,
which is the name of the song.
5:11
Both have an artist column, which lists
the recording artist of the song.
5:15
And both have a BB.Week field which is
the release date of the weekly Billboard
5:20
100 chart.
5:24
So for our purposes we will use
the combination of name, artists, and
5:26
BB.Week to merge our data frames.
5:30
Actually, we have a lot of data here.
5:34
Let's first work with a smaller sample of
the data just to make sure we're doing
5:36
it right.
5:41
If you remember 2017, 2018, there were
a few people who dominated American music,
5:41
Drake, Ariana Grande,
Imagine Dragons, few others.
5:48
They all had a lot of
songs on these charts.
5:53
Let's choose Ariana.
5:56
I'll make a data frame of just Ariana
Grande's songs on the Billboard chart.
5:59
Ariana_bill =
billboard[billboard["Artists"],
6:07
.str.contains("Ariana Grande").
6:18
Do the same with Spotify data, ariana_spot
6:29
=
spotify[spotify["Artists"].str.contains("-
6:37
Ariana Grande")].
6:46
By the way, make sure you
capitalize the A in artists,
6:52
the leading A in Ariana, and the G in
Grande, everything else is lowercase.
6:56
Our data is very case sensitive.
7:03
Now, let's check the dimensions of each
Ariana data frame, ariana_bill.shape,
7:07
Ariana_spot.shape.
7:21
A few 100 records is much more manageable.
7:26
If we find any errors we
can check them manually.
7:28
Let's talk about the merge function.
7:32
It has two required arguments, the left
data frame and the right data frame.
7:35
By default, it performs an inner
join on the index column.
7:40
But we don't want that,
we want a left join on the name,
7:44
artists, and BB.Week columns.
7:49
So let's see what that looks like.
7:53
Ariana_combined =
7:58
pd.merge(ariana_bill, ariana_spot,
8:02
how= 'left',
8:13
on=['Name'], 'Artist',
8:19
BB.Week]).
8:30
So we called our new data
frame ariana_combined.
8:36
ariana_bill is our left data set.
8:41
ariana_spot is our right data set.
8:44
The how argument is the type of join
we wanna perform, which is left.
8:48
The on argument is a column or
columns on which to perform the join.
8:53
Since we have three columns,
we need to enter them as a Python list.
8:58
Because this is a left join, I expect for
9:02
this new data frame to have the same
number of rows as the left data frame.
9:04
Let's see, arianna_combined.
9:09
shape.
9:13
Great, and now let's look at
the head of this new data set,
9:17
ariana_combined.head.
9:23
And now I see that my new data set has all
the columns from the Billboard data frame,
9:32
and a few extra columns taken
from the Spotify data frame.
9:36
Let's display this data frame again.
9:40
But since we only really care
about certain columns,
9:42
let's specify which columns we wanna view
and in which order.
9:44
ariana_combined[['Artists', 'Name',
9:51
'BB.Week', 'Weekly.Rank',
10:00
'Streams'}].
10:07
Now we have a better
understanding of this data merge.
10:15
Each record shows Ariana Grande as
the artist, the name of the song,
10:19
the Billboard chart date for that week,
the Billboard ranking for that week, and
10:22
the number of Spotify streams for
that week.
10:26
For any missing data,
there's a value of NaN or not a number.
10:30
This means that there's no matching
record between the two data sets.
10:35
It's a bit hard to find for
10:39
Ariana Grande, most of her songs that
charted on Billboard had a lot of streams.
10:40
There has to be a few so let's check for
10:45
songs in ariana_combined where
Streams is NaN, not a number.
10:48
We'll use the isnull() function.
10:53
ariana_combined[ariana_combined
10:57
['Streams'].isnull()
11:09
For example, the song Everyday,
11:19
it was number 85 on the Billboard
chart on March 4, 2017.
11:22
But we don't know how many Spotify
streams because it was not in the top 200
11:26
streaming songs for that week.
11:30
Okay, I've added some helpful
resources in the teacher's notes,
11:34
including some other optional arguments
available for the merge function.
11:37
Now here's my challenge to you.
11:41
We merged a portion of
our data frames together,
11:44
specifically the Ariana Grande
songs in our list.
11:46
I would like you to use the same method
demonstrated in this video to combine
11:51
the full Billboard and Spotify data sets,
call your new data frame bill_spot.
11:57
In the next video I'll
show you my solution, and
12:03
then we'll learn how to concatenate
two data sets within the Pandas tool.
12:06
See you there.
12:10
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up