Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Video Player
00:00
00:00
00:00
- 2x 2x
- 1.75x 1.75x
- 1.5x 1.5x
- 1.25x 1.25x
- 1.1x 1.1x
- 1x 1x
- 0.75x 0.75x
- 0.5x 0.5x
Welcome! In this video, we'll wrap up our workshop on combining data and I'll share some final thoughts on the dataset used throughout the workshop.
Alternative steps to combine data
# Merge Billboard 2017-2018 and Spotify 2017-2018
bs0 = pd.merge(billboard, spotify, how='left', on=['Name', 'Artists', 'BB.Week'])
# Merge Billboard 2019 and Spotify 2019
bs19 = pd.merge(billboard19, spotify19, how='left', on=['Name', 'Artists', 'BB.Week'])
# Concatenate 2017-2018 dataset and 2019 dataset
bs_final = pd.concat([bs0, bs19])
# Confirm dimensions are identical
bs_final.shape
bill_spot_final.shape
Additional Resources
- pandas API - alternate .merge() function
- pandas API - .join() function
- pandas API - .append() function - This has since been depreciated and we should be using concat
- Kaggle Data - Songs from Billboard 1999-2019
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Congratulations! You now have a better
understanding on how to combine data using
0:01
Pandas and are ready to make strides
in your data analysis projects.
0:05
I do want to address a few things that may
impact how you approach future projects.
0:10
First, our data is case sensitive.
0:15
When I first reviewed this data set,
Ariana's song names were in the proper
0:18
case on the Billboard chart, but
in all lowercase on the Spotify chart.
0:22
If we were to try to merge
the data with this mismatch,
0:26
we would have been unsuccessful.
0:29
In other words, No Tears Left to Cry
in proper case is different from
0:32
no tears left to cry in all lowercase.
0:36
I corrected this issue by converting all
of her song titles in both datasets to
0:40
proper case, but
there's still much work to be done.
0:44
For example, let's take a look at songs by
Kendrick Lamar in my bill_spot dataset.
0:49
Bill_spot [bill_spot] Artist,
0:57
== Kendrick Lamar,
1:09
.head parentheses.
1:17
[SOUND] Love was a very
popular song in 2018, but
1:22
we don't seem to have any stream data for
it.
1:25
Let's look at the Spotify dataset.
1:28
Spotify[spotify[ Artists,
1:35
== Kendrick Lamar.
1:41
.head().
1:51
Do you see the problem?
1:57
That's right.
1:59
Love and LOVE in all caps
are not the same thing, right?
2:00
This data would benefit from some
additional cleaning and preparation.
2:07
For example, making sure the case of
the song names match between Billboard and
2:11
Spotify datasets.
2:16
Making sure the recording artists match,
making the song names match.
2:17
The good news is that the vast
majority of our data is just fine.
2:23
So we'll get decent
results with what we have.
2:27
So, this is a battle we
can fight another time.
2:30
Next, let's look at our
four initial datasets.
2:34
Billboard, spotify,
billboard19 and spotify19.
2:41
We concatenated the two Billboard sets,
then concatenated the two Spotify sets.
2:48
And finally,
2:54
merge the combined Billboard data frame
with the combined Spotify data frame.
2:55
We could have completed this
project in a different order.
3:01
Remember that we first merged the original
Billboard data with the original
3:04
Spotify data.
3:07
We could have merged the 2019 Billboard
data set with the 2019 Spotify data set.
3:09
From there,
you can concatenate the combined 2017,
3:15
2018 data with the 2019 data,
let me show you how.
3:19
I'll recreate my original merge.
3:27
I'll call it bs0 for short.
3:29
Bs0
3:33
= pd.merge(billboard,
3:34
spotify, how=
3:43
'left', on=['Name',
3:48
'Artists', 'BB.Week']).
3:54
Then merge billboard19 and
4:06
spotify19, call it bs19.
4:10
bs19 = pd.merge(billboard19,
4:14
spotify19, how='left',
4:23
on=['Name', 'Artists', 'BB.Week']).
4:33
Finally, we concatenate bs0 and
bs19 as bs_final.
4:48
bs_final, =
4:58
pd.concat(bs0, bs19).
5:02
bs_final.shape,
5:14
bill_spot_final.shape.
5:21
It's safe to say they're
virtually identical.
5:28
Lastly, there are other methods
available to combine data frames.
5:33
There's another merge() function
with a different signature.
5:38
There's a join() function which works
like merge(), but using the index columns.
5:42
And, there's append(),
which is very similar to concat().
5:47
You may learn more about these
other functions on your own and
5:52
decide to use those instead.
5:56
As my mother always tells me,
six in one hand,
5:58
half a dozen in the other,
it's all the same.
6:01
Actually there's one more thing.
6:05
I've attached a link to
the original data found on Kaggle,
6:07
a huge repository used by
the data science community.
6:11
The author use this data for
a college project.
6:15
I'd like to think they earned an A.
6:18
It includes 20 years of Billboard charts,
and RI certification data,
6:20
meaning gold, platinum,
and diamond records.
6:25
It also includes some Grammy Award
winners and Spotify streaming data.
6:30
So, if you wish to find a relationship
between Billboard number one songs and
6:35
diamond certified records, or if the most
stream songs on Spotify win Grammy Awards,
6:40
this dataset is a decent place to start.
6:45
I hope this was helpful.
6:50
If you have any questions, feel free to
post them in the question section for
6:51
any of the videos in this workshop.
6:55
You can also reach out to other
learners in the community or
6:57
contact the amazing Treehouse staff.
7:00
Until next time, Professor T... OUT!
7:03
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up