1 00:00:00,000 --> 00:00:04,716 [MUSIC] 2 00:00:04,716 --> 00:00:07,187 In the data science and business software worlds, 3 00:00:07,187 --> 00:00:09,990 it's hard to escape the dreaded spreadsheet. 4 00:00:09,990 --> 00:00:12,350 Well, maybe you don't dread them as much as I do. 5 00:00:12,350 --> 00:00:15,460 I just never learned to use Office software appropriately. 6 00:00:15,460 --> 00:00:16,590 Okay, enough about me. 7 00:00:16,590 --> 00:00:19,430 Anyway, in those two worlds and many others, 8 00:00:19,430 --> 00:00:24,750 a lot of your work as a coder will be using CSV or comma separated value files. 9 00:00:24,750 --> 00:00:29,040 These files usually have a header row and then multiple rows below them with each 10 00:00:29,040 --> 00:00:32,040 columns of value separated from the previous column by a comma. 11 00:00:33,090 --> 00:00:35,890 Yeah, it's a spreadsheet without all of the Excel excellence. 12 00:00:37,040 --> 00:00:40,790 Python though has a pretty robust CSV module that lets us read and 13 00:00:40,790 --> 00:00:43,580 write tabular data in this popular format. 14 00:00:43,580 --> 00:00:46,760 Now if you've done the data science basics course you've already seen how to 15 00:00:46,760 --> 00:00:51,720 use CSVs, but it's worth focusing on this for all of you that have to use CSVs but 16 00:00:51,720 --> 00:00:55,350 don't need all of the information in that course. 17 00:00:55,350 --> 00:01:00,160 I have a CSV here from the Canada Science and Technology Museums Corporation. 18 00:01:00,160 --> 00:01:04,000 And has about a hundred of their artifacts in it but I 19 00:01:04,000 --> 00:01:09,120 can't display CSVs in work spaces and even if I could I don't really wanna read it. 20 00:01:09,120 --> 00:01:12,020 I wanna use it, so let's make a file for this. 21 00:01:12,020 --> 00:01:17,280 I'm gonna say new file, artifacts.py, 22 00:01:17,280 --> 00:01:21,200 Python's CSV module is where we wanna start, at least most of the time. 23 00:01:21,200 --> 00:01:24,590 I'll put a link or two in the teacher's notes for some other CSV libraries. 24 00:01:24,590 --> 00:01:27,500 The biggest problem that Python's built-in one has 25 00:01:27,500 --> 00:01:31,360 is that it's slightly lacking when it comes to Unicode support. 26 00:01:31,360 --> 00:01:33,360 But let's just see what we can do with it. 27 00:01:33,360 --> 00:01:34,480 So we're gonna import csv. 28 00:01:35,910 --> 00:01:38,830 And we're gonna, obviously the first thing I have to do is import the library. 29 00:01:38,830 --> 00:01:41,840 And then I wanna see about reading from our file. 30 00:01:41,840 --> 00:01:45,010 We need to open the file a little bit differently than we normally would do. 31 00:01:45,010 --> 00:01:48,190 Normally we would just get with you know a flag or something. 32 00:01:48,190 --> 00:01:51,870 And in this case we wanna give it a new line argument, so let's just say, 33 00:01:51,870 --> 00:01:57,120 with open museum.csv, 34 00:01:57,120 --> 00:02:00,280 and then when I say newline is equal to and just an empty thing here. 35 00:02:00,280 --> 00:02:04,650 The reason I want to do that is because if we have a new line inside 36 00:02:04,650 --> 00:02:08,570 of a quoted section then we want to be sure and 37 00:02:08,570 --> 00:02:13,300 be able to differentiate that from an actual new line marking a new line. 38 00:02:13,300 --> 00:02:15,040 So yeah. 39 00:02:15,040 --> 00:02:17,320 And we're just gonna call this csvfile. 40 00:02:17,320 --> 00:02:18,820 Oops, I forgot my colon. 41 00:02:18,820 --> 00:02:21,510 And so now we need to make our CSV reader. 42 00:02:21,510 --> 00:02:25,680 And we instantiate this reader so that we can tell it how to read from the file. 43 00:02:25,680 --> 00:02:32,030 So I'm gonna call this, I don't know, artreader, and it will be csv.reader. 44 00:02:32,030 --> 00:02:34,090 And it's gonna be a csv file. 45 00:02:34,090 --> 00:02:35,650 And the delimiter, 46 00:02:35,650 --> 00:02:40,980 I know this from looking at my, I think that's how you spell it, 47 00:02:40,980 --> 00:02:45,560 they actually used a pipe character as the delimiter instead of a comma. 48 00:02:45,560 --> 00:02:48,640 So really it's a pipe separated value file. 49 00:02:48,640 --> 00:02:51,270 Whatever, the idea is that something separates each thing. 50 00:02:51,270 --> 00:02:55,300 You kind of have to look at your csv files to see what they're using as a delimiter. 51 00:02:55,300 --> 00:02:59,490 All right, so our artreader object here is a generator. 52 00:02:59,490 --> 00:03:00,560 That means we can't index it. 53 00:03:00,560 --> 00:03:05,520 We can't do like artreader two but I only wanna look at a few of the records. 54 00:03:05,520 --> 00:03:07,810 So I'm gonna go ahead and turn this thing into a list. 55 00:03:07,810 --> 00:03:12,890 So I'm gonna say rows = list (artreader). 56 00:03:12,890 --> 00:03:14,830 And now I can loop over it. 57 00:03:14,830 --> 00:03:22,030 So for row in rows up to two, let's print a comma and a space. 58 00:03:22,030 --> 00:03:24,150 And we'll join whatever is in that row. 59 00:03:25,570 --> 00:03:27,530 So okay, cool. 60 00:03:27,530 --> 00:03:32,344 And now, let's go ahead and python artifacts.py. 61 00:03:33,450 --> 00:03:34,510 So, cool. 62 00:03:34,510 --> 00:03:37,220 There is the stuff that's in our file. 63 00:03:37,220 --> 00:03:40,550 So you can see our first row that we printed out there, 64 00:03:40,550 --> 00:03:41,570 those are our field names. 65 00:03:41,570 --> 00:03:45,560 So, artifact number, object name, category three, and so on. 66 00:03:45,560 --> 00:03:50,200 And then we get this thing here, that here's the number, it's a cover, 67 00:03:50,200 --> 00:03:51,530 it's paper. 68 00:03:51,530 --> 00:03:53,090 And so on. 69 00:03:53,090 --> 00:03:53,940 So that's cool. 70 00:03:53,940 --> 00:03:58,310 The problem with this is, I have to know, like, which column the thing I want is in. 71 00:03:58,310 --> 00:04:02,140 And that's not always the easiest thing, right? 72 00:04:02,140 --> 00:04:06,320 So let's turn these into dictionaries instead of just lists. 73 00:04:07,440 --> 00:04:10,180 And we can do that by changing our reader. 74 00:04:10,180 --> 00:04:12,690 So we have this reader here. 75 00:04:12,690 --> 00:04:14,800 Think of this as like a list reader and 76 00:04:14,800 --> 00:04:18,460 we can actually replace that with what's called DictReader. 77 00:04:18,460 --> 00:04:22,800 And DictReader reads and gives us dictionaries. 78 00:04:22,800 --> 00:04:24,980 So that's pretty cool, all right. 79 00:04:24,980 --> 00:04:26,970 So we'll just take those first two. 80 00:04:26,970 --> 00:04:28,600 Actually, let's skip that first one. 81 00:04:29,800 --> 00:04:31,760 And let's print out row and 82 00:04:31,760 --> 00:04:37,050 then let's print out group1, so whatever they have in their group1 field. 83 00:04:37,050 --> 00:04:40,350 Let's go ahead and run that again. 84 00:04:40,350 --> 00:04:42,610 And our two groups, or 85 00:04:42,610 --> 00:04:46,240 our two rows rather, both have Aviation as their group1. 86 00:04:46,240 --> 00:04:48,080 So that's pretty cool. 87 00:04:48,080 --> 00:04:51,190 We can give a list of field names to the DictReader, 88 00:04:51,190 --> 00:04:53,830 when we specify the DictReader right here. 89 00:04:53,830 --> 00:04:57,870 And it'll use those as the field names in order, but since we didn't do that, 90 00:04:57,870 --> 00:05:03,930 it actually reads in that first row, and uses those as the field names. 91 00:05:03,930 --> 00:05:05,820 So it's kind of smart, which is pretty cool. 92 00:05:05,820 --> 00:05:09,370 Let's change group one over there to manucity. 93 00:05:09,370 --> 00:05:10,960 Which is the manufactured city. 94 00:05:12,170 --> 00:05:15,490 If we run that, then we don't have a manufactured city for 95 00:05:15,490 --> 00:05:17,250 either of those two items. 96 00:05:17,250 --> 00:05:18,940 Alright, cool. 97 00:05:18,940 --> 00:05:21,130 This is for working with a reader. 98 00:05:21,130 --> 00:05:24,254 Just the thing that reads in the data and parses it. 99 00:05:24,254 --> 00:05:28,620 Let's talk about making something that writes out to a CSV. 100 00:05:28,620 --> 00:05:31,099 I'm actually going to do a new file for this. 101 00:05:32,850 --> 00:05:36,990 Just because I kind of want to keep these a little bit separate, right? 102 00:05:36,990 --> 00:05:39,950 So I'm gonna call this teachers.py, and again I'm gonna import CSV. 103 00:05:39,950 --> 00:05:48,220 And then I'm gonna do with open teachers.csv with the a format as csvfile, 104 00:05:48,220 --> 00:05:52,820 and I'm gonna say my field names are first name 105 00:05:55,190 --> 00:05:58,820 and last name and topic. 106 00:05:58,820 --> 00:06:03,370 You can, of course, put whatever fieldnames in here you want. 107 00:06:03,370 --> 00:06:05,710 You can have hundreds of these, that's completely up to you. 108 00:06:05,710 --> 00:06:07,526 I'm going to call this 'Teach Writer'. 109 00:06:08,530 --> 00:06:14,860 And we'll do csv.DictWriter and we're going to write to our csv files. 110 00:06:14,860 --> 00:06:19,700 And our field names are the field names variable. 111 00:06:20,990 --> 00:06:23,170 So then let's do teachwriter. 112 00:06:23,170 --> 00:06:29,070 Writeheader and so that writes out a row that has the header stuff in it. 113 00:06:30,280 --> 00:06:33,996 And then we'll do teachwriter.writerow. 114 00:06:36,274 --> 00:06:41,963 And we'll write first_name is Kenneth. 115 00:06:43,206 --> 00:06:49,463 Last_name is Love and topic is Python. 116 00:06:51,120 --> 00:06:53,390 And I'm gonna rack those up. 117 00:06:55,590 --> 00:06:56,590 All right, cool. 118 00:06:58,080 --> 00:06:59,100 I'm actually pretty good. 119 00:06:59,100 --> 00:07:01,990 If I had a list of dictionaries that already had all my things in it, 120 00:07:01,990 --> 00:07:04,260 then I could pass those in to the right rows method as well. 121 00:07:04,260 --> 00:07:06,530 Or I could call right row multiple times. 122 00:07:06,530 --> 00:07:10,060 So, like let's add in two rows here. 123 00:07:10,060 --> 00:07:16,700 I'm gonna add in Alena Holligan, and she teaches PHP. 124 00:07:16,700 --> 00:07:19,486 All right, so now let's run this. 125 00:07:19,486 --> 00:07:23,886 Python teachers.py. 126 00:07:23,886 --> 00:07:31,940 And then if I was to cat teachers.csv, 127 00:07:31,940 --> 00:07:38,150 then we've got our header row, and then we have the row for me and the row for Alena. 128 00:07:38,150 --> 00:07:39,480 So that's pretty cool, so 129 00:07:39,480 --> 00:07:42,460 reading and writing CSVs is fairly straightforward in Python. 130 00:07:42,460 --> 00:07:46,020 The main thing you have to remember to do is create these field names if you're 131 00:07:46,020 --> 00:07:49,630 writing I recommend using the DictReader and DictWriter. 132 00:07:49,630 --> 00:07:51,590 They're a little bit more useful.