Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Welcome! In this video, we'll wrap up our workshop on combining data and I'll share some final thoughts on the dataset used throughout the workshop.
Alternative steps to combine data
# Merge Billboard 2017-2018 and Spotify 2017-2018
bs0 = pd.merge(billboard, spotify, how='left', on=['Name', 'Artists', 'BB.Week'])
# Merge Billboard 2019 and Spotify 2019
bs19 = pd.merge(billboard19, spotify19, how='left', on=['Name', 'Artists', 'BB.Week'])
# Concatenate 2017-2018 dataset and 2019 dataset
bs_final = pd.concat([bs0, bs19])
# Confirm dimensions are identical
bs_final.shape
bill_spot_final.shape
Additional Resources
- pandas API - alternate .merge() function
- pandas API - .join() function
- pandas API - .append() function - This has since been depreciated and we should be using concat
- Kaggle Data - Songs from Billboard 1999-2019
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Congratulations! You now have a better
understanding on how to combine data using
0:01
Pandas and are ready to make strides
in your data analysis projects.
0:05
I do want to address a few things that may
impact how you approach future projects.
0:10
First, our data is case sensitive.
0:15
When I first reviewed this data set,
Ariana's song names were in the proper
0:18
case on the Billboard chart, but
in all lowercase on the Spotify chart.
0:22
If we were to try to merge
the data with this mismatch,
0:26
we would have been unsuccessful.
0:29
In other words, No Tears Left to Cry
in proper case is different from
0:32
no tears left to cry in all lowercase.
0:36
I corrected this issue by converting all
of her song titles in both datasets to
0:40
proper case, but
there's still much work to be done.
0:44
For example, let's take a look at songs by
Kendrick Lamar in my bill_spot dataset.
0:49
Bill_spot [bill_spot] Artist,
0:57
== Kendrick Lamar,
1:09
.head parentheses.
1:17
[SOUND] Love was a very
popular song in 2018, but
1:22
we don't seem to have any stream data for
it.
1:25
Let's look at the Spotify dataset.
1:28
Spotify[spotify[ Artists,
1:35
== Kendrick Lamar.
1:41
.head().
1:51
Do you see the problem?
1:57
That's right.
1:59
Love and LOVE in all caps
are not the same thing, right?
2:00
This data would benefit from some
additional cleaning and preparation.
2:07
For example, making sure the case of
the song names match between Billboard and
2:11
Spotify datasets.
2:16
Making sure the recording artists match,
making the song names match.
2:17
The good news is that the vast
majority of our data is just fine.
2:23
So we'll get decent
results with what we have.
2:27
So, this is a battle we
can fight another time.
2:30
Next, let's look at our
four initial datasets.
2:34
Billboard, spotify,
billboard19 and spotify19.
2:41
We concatenated the two Billboard sets,
then concatenated the two Spotify sets.
2:48
And finally,
2:54
merge the combined Billboard data frame
with the combined Spotify data frame.
2:55
We could have completed this
project in a different order.
3:01
Remember that we first merged the original
Billboard data with the original
3:04
Spotify data.
3:07
We could have merged the 2019 Billboard
data set with the 2019 Spotify data set.
3:09
From there,
you can concatenate the combined 2017,
3:15
2018 data with the 2019 data,
let me show you how.
3:19
I'll recreate my original merge.
3:27
I'll call it bs0 for short.
3:29
Bs0
3:33
= pd.merge(billboard,
3:34
spotify, how=
3:43
'left', on=['Name',
3:48
'Artists', 'BB.Week']).
3:54
Then merge billboard19 and
4:06
spotify19, call it bs19.
4:10
bs19 = pd.merge(billboard19,
4:14
spotify19, how='left',
4:23
on=['Name', 'Artists', 'BB.Week']).
4:33
Finally, we concatenate bs0 and
bs19 as bs_final.
4:48
bs_final, =
4:58
pd.concat(bs0, bs19).
5:02
bs_final.shape,
5:14
bill_spot_final.shape.
5:21
It's safe to say they're
virtually identical.
5:28
Lastly, there are other methods
available to combine data frames.
5:33
There's another merge() function
with a different signature.
5:38
There's a join() function which works
like merge(), but using the index columns.
5:42
And, there's append(),
which is very similar to concat().
5:47
You may learn more about these
other functions on your own and
5:52
decide to use those instead.
5:56
As my mother always tells me,
six in one hand,
5:58
half a dozen in the other,
it's all the same.
6:01
Actually there's one more thing.
6:05
I've attached a link to
the original data found on Kaggle,
6:07
a huge repository used by
the data science community.
6:11
The author use this data for
a college project.
6:15
I'd like to think they earned an A.
6:18
It includes 20 years of Billboard charts,
and RI certification data,
6:20
meaning gold, platinum,
and diamond records.
6:25
It also includes some Grammy Award
winners and Spotify streaming data.
6:30
So, if you wish to find a relationship
between Billboard number one songs and
6:35
diamond certified records, or if the most
stream songs on Spotify win Grammy Awards,
6:40
this dataset is a decent place to start.
6:45
I hope this was helpful.
6:50
If you have any questions, feel free to
post them in the question section for
6:51
any of the videos in this workshop.
6:55
You can also reach out to other
learners in the community or
6:57
contact the amazing Treehouse staff.
7:00
Until next time, Professor T... OUT!
7:03
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up