1 00:00:00,610 --> 00:00:01,570 Welcome. 2 00:00:01,570 --> 00:00:04,472 In this video, we will set up our data frames in Pandas, 3 00:00:04,472 --> 00:00:08,290 then merge the data from two data frames into one combined data frame. 4 00:00:10,010 --> 00:00:13,523 Download the two CSV files from the teacher's notes to make sure you remember 5 00:00:13,523 --> 00:00:14,180 where you save them. 6 00:00:15,380 --> 00:00:17,970 Anaconda navigator is open on my screen. 7 00:00:17,970 --> 00:00:19,300 We'll run JupyterLab. 8 00:00:19,300 --> 00:00:21,414 So I click the Launch button in the JupyterLab cell. 9 00:00:26,380 --> 00:00:28,580 It'll launch a new tab in my browser window. 10 00:00:29,910 --> 00:00:32,016 This is my file tree on the left, 11 00:00:32,016 --> 00:00:35,387 I'll navigate to where I downloaded my CSV files. 12 00:00:35,387 --> 00:00:39,092 I created a folder specifically for this workshop, 13 00:00:39,092 --> 00:00:41,964 Documents > anaconda > combine-data. 14 00:00:44,604 --> 00:00:47,900 CSV stands for comma separated values. 15 00:00:47,900 --> 00:00:50,010 A CSV file is a text file. 16 00:00:50,010 --> 00:00:54,394 Each line of text is a row, and the commas separate each line into columns. 17 00:00:57,616 --> 00:00:59,967 The first line of text contains the headings and 18 00:00:59,967 --> 00:01:02,040 each additional line is a record of data. 19 00:01:03,740 --> 00:01:08,001 Billboard_100_2017-2018 shows 20 00:01:08,001 --> 00:01:13,176 the Billboard 100 chart data from 2017 to 2018. 21 00:01:13,176 --> 00:01:17,942 JupyterLab displays the file as a chart in rows and columns. 22 00:01:17,942 --> 00:01:27,670 Spotify_200_2017-2018 contains the Spotify 200 data from those years. 23 00:01:31,953 --> 00:01:34,810 Back on my Launcher tab, I want to create a new notebook. 24 00:01:34,810 --> 00:01:36,700 So I'll click Python3 under Notebook. 25 00:01:38,320 --> 00:01:41,919 I can rename the notebook now, right-click or 26 00:01:41,919 --> 00:01:45,719 Ctrl-click the notebook name, choose Rename, 27 00:01:49,065 --> 00:01:53,202 Then enter billboard_spotify. 28 00:01:56,150 --> 00:01:59,847 Let's first import the Pandas package and check the version. 29 00:02:01,603 --> 00:02:04,185 import pandas as pd. 30 00:02:09,033 --> 00:02:12,865 pd.__version__. 31 00:02:15,030 --> 00:02:16,780 I'll click the Run button here in the toolbar. 32 00:02:18,210 --> 00:02:21,742 The keyboard shortcut to run the code in the cell is Shift+Enter, 33 00:02:21,742 --> 00:02:23,455 I'll use them moving forward. 34 00:02:23,455 --> 00:02:26,413 Pressing Enter just adds a new line within this cell, 35 00:02:26,413 --> 00:02:29,270 like I did here with input and version. 36 00:02:29,270 --> 00:02:32,830 But this shows that Pandas is installed, your version may be higher. 37 00:02:34,470 --> 00:02:38,168 Then let's load the data from each CSV file into a new data frame. 38 00:02:41,902 --> 00:02:46,700 Billboard = pd.read_csv 39 00:02:48,904 --> 00:02:58,904 ("Billboard_100_2017- 40 00:02:59,976 --> 00:03:05,680 2018.csv", 41 00:03:05,680 --> 00:03:13,063 index_col="ID"). 42 00:03:13,063 --> 00:03:15,512 spotify = 43 00:03:15,512 --> 00:03:25,512 pd.read_csv("Spotify_200_2017-2018.csv, 44 00:03:31,572 --> 00:03:37,562 index_col="ID"). 45 00:03:41,002 --> 00:03:44,996 The index_col tells Pandas that the ID field is the index or 46 00:03:44,996 --> 00:03:48,370 unique identifier for each record. 47 00:03:48,370 --> 00:03:51,500 In other words, no two rows in a data frame should have the same ID number. 48 00:03:52,510 --> 00:03:55,750 We identify this field so that Pandas doesn't add its own index column. 49 00:03:57,160 --> 00:04:01,250 Let's use a couple of Pandas tools to get a sense of what we just imported. 50 00:04:01,250 --> 00:04:04,893 First .shape will show us the dimensions of each data frame. 51 00:04:08,307 --> 00:04:16,580 Billboard.shape, spotify.shape. 52 00:04:16,580 --> 00:04:20,649 This shows us there are over 10,000 records in the Billboard data set, 53 00:04:20,649 --> 00:04:22,660 with 7 columns other than the index. 54 00:04:23,770 --> 00:04:28,720 There are more than 20,000 rows and 5 non-index columns in the Spotify data set. 55 00:04:30,040 --> 00:04:34,258 Let's get a peek at the first few rows of each data frame using the head method. 56 00:04:37,018 --> 00:04:42,373 Billboard.head(). 57 00:04:42,373 --> 00:04:47,210 This shows us the column headings and the first five rows of data. 58 00:04:47,210 --> 00:04:53,106 By the way, if I wanted to see eight rows, for example, I would say head(8), 59 00:04:53,106 --> 00:04:59,797 like this spotify.head(8). 60 00:05:04,254 --> 00:05:07,650 But five is the default if I don't specify an argument. 61 00:05:07,650 --> 00:05:10,380 And this is fine because I really only care about the headings right now. 62 00:05:11,840 --> 00:05:15,590 Both data frames have a name field, which is the name of the song. 63 00:05:15,590 --> 00:05:19,130 Both have an artist column, which lists the recording artist of the song. 64 00:05:20,170 --> 00:05:24,357 And both have a BB.Week field which is the release date of the weekly Billboard 65 00:05:24,357 --> 00:05:25,190 100 chart. 66 00:05:26,570 --> 00:05:30,625 So for our purposes we will use the combination of name, artists, and 67 00:05:30,625 --> 00:05:32,700 BB.Week to merge our data frames. 68 00:05:34,680 --> 00:05:36,900 Actually, we have a lot of data here. 69 00:05:36,900 --> 00:05:41,121 Let's first work with a smaller sample of the data just to make sure we're doing 70 00:05:41,121 --> 00:05:41,691 it right. 71 00:05:41,691 --> 00:05:48,341 If you remember 2017, 2018, there were a few people who dominated American music, 72 00:05:48,341 --> 00:05:53,090 Drake, Ariana Grande, Imagine Dragons, few others. 73 00:05:53,090 --> 00:05:56,060 They all had a lot of songs on these charts. 74 00:05:56,060 --> 00:05:59,221 Let's choose Ariana. 75 00:05:59,221 --> 00:06:03,535 I'll make a data frame of just Ariana Grande's songs on the Billboard chart. 76 00:06:07,276 --> 00:06:15,887 Ariana_bill = billboard[billboard["Artists"], 77 00:06:18,686 --> 00:06:25,560 .str.contains("Ariana Grande"). 78 00:06:29,242 --> 00:06:37,325 Do the same with Spotify data, ariana_spot 79 00:06:37,325 --> 00:06:46,473 = spotify[spotify["Artists"].str.contains("- 80 00:06:46,473 --> 00:06:52,645 Ariana Grande")]. 81 00:06:52,645 --> 00:06:56,991 By the way, make sure you capitalize the A in artists, 82 00:06:56,991 --> 00:07:03,600 the leading A in Ariana, and the G in Grande, everything else is lowercase. 83 00:07:03,600 --> 00:07:05,492 Our data is very case sensitive. 84 00:07:07,843 --> 00:07:15,508 Now, let's check the dimensions of each Ariana data frame, ariana_bill.shape, 85 00:07:21,638 --> 00:07:24,660 Ariana_spot.shape. 86 00:07:26,042 --> 00:07:28,030 A few 100 records is much more manageable. 87 00:07:28,030 --> 00:07:31,380 If we find any errors we can check them manually. 88 00:07:32,850 --> 00:07:35,000 Let's talk about the merge function. 89 00:07:35,000 --> 00:07:40,400 It has two required arguments, the left data frame and the right data frame. 90 00:07:40,400 --> 00:07:44,800 By default, it performs an inner join on the index column. 91 00:07:44,800 --> 00:07:49,322 But we don't want that, we want a left join on the name, 92 00:07:49,322 --> 00:07:52,130 artists, and BB.Week columns. 93 00:07:53,130 --> 00:07:54,229 So let's see what that looks like. 94 00:07:58,331 --> 00:08:02,457 Ariana_combined = 95 00:08:02,457 --> 00:08:11,401 pd.merge(ariana_bill, ariana_spot, 96 00:08:13,578 --> 00:08:15,763 how= 'left', 97 00:08:19,281 --> 00:08:28,045 on=['Name'], 'Artist', 98 00:08:30,685 --> 00:08:31,841 BB.Week]). 99 00:08:36,370 --> 00:08:41,199 So we called our new data frame ariana_combined. 100 00:08:41,199 --> 00:08:44,275 ariana_bill is our left data set. 101 00:08:44,275 --> 00:08:48,620 ariana_spot is our right data set. 102 00:08:48,620 --> 00:08:53,283 The how argument is the type of join we wanna perform, which is left. 103 00:08:53,283 --> 00:08:58,200 The on argument is a column or columns on which to perform the join. 104 00:08:58,200 --> 00:09:02,610 Since we have three columns, we need to enter them as a Python list. 105 00:09:02,610 --> 00:09:04,893 Because this is a left join, I expect for 106 00:09:04,893 --> 00:09:09,590 this new data frame to have the same number of rows as the left data frame. 107 00:09:09,590 --> 00:09:13,139 Let's see, arianna_combined. 108 00:09:13,139 --> 00:09:14,291 shape. 109 00:09:17,442 --> 00:09:23,165 Great, and now let's look at the head of this new data set, 110 00:09:23,165 --> 00:09:25,924 ariana_combined.head. 111 00:09:32,265 --> 00:09:36,405 And now I see that my new data set has all the columns from the Billboard data frame, 112 00:09:36,405 --> 00:09:39,350 and a few extra columns taken from the Spotify data frame. 113 00:09:40,520 --> 00:09:42,200 Let's display this data frame again. 114 00:09:42,200 --> 00:09:44,176 But since we only really care about certain columns, 115 00:09:44,176 --> 00:09:48,102 let's specify which columns we wanna view and in which order. 116 00:09:51,250 --> 00:10:00,355 ariana_combined[['Artists', 'Name', 117 00:10:00,355 --> 00:10:07,129 'BB.Week', 'Weekly.Rank', 118 00:10:07,129 --> 00:10:10,741 'Streams'}]. 119 00:10:15,758 --> 00:10:19,060 Now we have a better understanding of this data merge. 120 00:10:19,060 --> 00:10:22,664 Each record shows Ariana Grande as the artist, the name of the song, 121 00:10:22,664 --> 00:10:26,954 the Billboard chart date for that week, the Billboard ranking for that week, and 122 00:10:26,954 --> 00:10:29,340 the number of Spotify streams for that week. 123 00:10:30,650 --> 00:10:35,050 For any missing data, there's a value of NaN or not a number. 124 00:10:35,050 --> 00:10:38,100 This means that there's no matching record between the two data sets. 125 00:10:39,220 --> 00:10:40,584 It's a bit hard to find for 126 00:10:40,584 --> 00:10:45,540 Ariana Grande, most of her songs that charted on Billboard had a lot of streams. 127 00:10:45,540 --> 00:10:48,497 There has to be a few so let's check for 128 00:10:48,497 --> 00:10:53,503 songs in ariana_combined where Streams is NaN, not a number. 129 00:10:53,503 --> 00:10:57,798 We'll use the isnull() function. 130 00:10:57,798 --> 00:11:09,165 ariana_combined[ariana_combined 131 00:11:09,165 --> 00:11:16,158 ['Streams'].isnull() 132 00:11:19,752 --> 00:11:22,208 For example, the song Everyday, 133 00:11:22,208 --> 00:11:26,876 it was number 85 on the Billboard chart on March 4, 2017. 134 00:11:26,876 --> 00:11:30,841 But we don't know how many Spotify streams because it was not in the top 200 135 00:11:30,841 --> 00:11:32,500 streaming songs for that week. 136 00:11:34,510 --> 00:11:37,526 Okay, I've added some helpful resources in the teacher's notes, 137 00:11:37,526 --> 00:11:40,910 including some other optional arguments available for the merge function. 138 00:11:41,970 --> 00:11:44,210 Now here's my challenge to you. 139 00:11:44,210 --> 00:11:46,841 We merged a portion of our data frames together, 140 00:11:46,841 --> 00:11:49,610 specifically the Ariana Grande songs in our list. 141 00:11:51,240 --> 00:11:57,163 I would like you to use the same method demonstrated in this video to combine 142 00:11:57,163 --> 00:12:03,590 the full Billboard and Spotify data sets, call your new data frame bill_spot. 143 00:12:03,590 --> 00:12:06,214 In the next video I'll show you my solution, and 144 00:12:06,214 --> 00:12:10,980 then we'll learn how to concatenate two data sets within the Pandas tool. 145 00:12:10,980 --> 00:12:11,480 See you there.