✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

🤑 Join the Treehouse affiliate program and earn 25% recurring commission!

Join our free workshop and get a Treehouse discount code! 💼 Register here!

New No-Code Track! 🚀 New videos dropping every week—start learning today!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

🤑 Join the Treehouse affiliate program and earn 25% recurring commission!

Well done!

You have completed Combining Data for Analysis!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse

Video Player

00:00

00:00

00:00

2x 2x
1.75x 1.75x
1.5x 1.5x
1.25x 1.25x
1.1x 1.1x
1x 1x
0.75x 0.75x
0.5x 0.5x

None
English

Use Up/Down Arrow keys to increase or decrease volume.

Conclusion and Final Thoughts

7:06 with Richard Tillies

Welcome! In this video, we'll wrap up our workshop on combining data and I'll share some final thoughts on the dataset used throughout the workshop.

Teacher's Notes
Questions?0
Video Transcript
Downloads
Workspaces

Congratulations! You now have a better understanding on how to combine data using 0:01

Pandas and are ready to make strides in your data analysis projects. 0:05

I do want to address a few things that may impact how you approach future projects. 0:10

First, our data is case sensitive. 0:15

When I first reviewed this data set, Ariana's song names were in the proper 0:18

case on the Billboard chart, but in all lowercase on the Spotify chart. 0:22

If we were to try to merge the data with this mismatch, 0:26

we would have been unsuccessful. 0:29

In other words, No Tears Left to Cry in proper case is different from 0:32

no tears left to cry in all lowercase. 0:36

I corrected this issue by converting all of her song titles in both datasets to 0:40

proper case, but there's still much work to be done. 0:44

For example, let's take a look at songs by Kendrick Lamar in my bill_spot dataset. 0:49

Bill_spot [bill_spot] Artist, 0:57

== Kendrick Lamar, 1:09

.head parentheses. 1:17

[SOUND] Love was a very popular song in 2018, but 1:22

we don't seem to have any stream data for it. 1:25

Let's look at the Spotify dataset. 1:28

Spotify[spotify[ Artists, 1:35

== Kendrick Lamar. 1:41

.head(). 1:51

Do you see the problem? 1:57

That's right. 1:59

Love and LOVE in all caps are not the same thing, right? 2:00

This data would benefit from some additional cleaning and preparation. 2:07

For example, making sure the case of the song names match between Billboard and 2:11

Spotify datasets. 2:16

Making sure the recording artists match, making the song names match. 2:17

The good news is that the vast majority of our data is just fine. 2:23

So we'll get decent results with what we have. 2:27

So, this is a battle we can fight another time. 2:30

Next, let's look at our four initial datasets. 2:34

Billboard, spotify, billboard19 and spotify19. 2:41

We concatenated the two Billboard sets, then concatenated the two Spotify sets. 2:48

And finally, 2:54

merge the combined Billboard data frame with the combined Spotify data frame. 2:55

We could have completed this project in a different order. 3:01

Remember that we first merged the original Billboard data with the original 3:04

Spotify data. 3:07

We could have merged the 2019 Billboard data set with the 2019 Spotify data set. 3:09

From there, you can concatenate the combined 2017, 3:15

2018 data with the 2019 data, let me show you how. 3:19

I'll recreate my original merge. 3:27

I'll call it bs0 for short. 3:29

Bs0 3:33

= pd.merge(billboard, 3:34

spotify, how= 3:43

'left', on=['Name', 3:48

'Artists', 'BB.Week']). 3:54

Then merge billboard19 and 4:06

spotify19, call it bs19. 4:10

bs19 = pd.merge(billboard19, 4:14

spotify19, how='left', 4:23

on=['Name', 'Artists', 'BB.Week']). 4:33

Finally, we concatenate bs0 and bs19 as bs_final. 4:48

bs_final, = 4:58

pd.concat(bs0, bs19). 5:02

bs_final.shape, 5:14

bill_spot_final.shape. 5:21

It's safe to say they're virtually identical. 5:28

Lastly, there are other methods available to combine data frames. 5:33

There's another merge() function with a different signature. 5:38

There's a join() function which works like merge(), but using the index columns. 5:42

And, there's append(), which is very similar to concat(). 5:47

You may learn more about these other functions on your own and 5:52

decide to use those instead. 5:56

As my mother always tells me, six in one hand, 5:58

half a dozen in the other, it's all the same. 6:01

Actually there's one more thing. 6:05

I've attached a link to the original data found on Kaggle, 6:07

a huge repository used by the data science community. 6:11

The author use this data for a college project. 6:15

I'd like to think they earned an A. 6:18

It includes 20 years of Billboard charts, and RI certification data, 6:20

meaning gold, platinum, and diamond records. 6:25

It also includes some Grammy Award winners and Spotify streaming data. 6:30

So, if you wish to find a relationship between Billboard number one songs and 6:35

diamond certified records, or if the most stream songs on Spotify win Grammy Awards, 6:40

this dataset is a decent place to start. 6:45

I hope this was helpful. 6:50

If you have any questions, feel free to post them in the question section for 6:51

any of the videos in this workshop. 6:55

You can also reach out to other learners in the community or 6:57

contact the amazing Treehouse staff. 7:00

Until next time, Professor T... OUT! 7:03