1 00:00:00,368 --> 00:00:04,574 [MUSIC] 2 00:00:04,574 --> 00:00:08,310 Regular expressions let us match patterns against text. 3 00:00:08,310 --> 00:00:10,220 For example, if I had a pattern that said, 4 00:00:10,220 --> 00:00:13,790 I want to find the first time the word ghost is used in Charles Dickens's, 5 00:00:13,790 --> 00:00:17,904 A Christmas Carol, I could do re.search(r'ghost,christmas_carol). 6 00:00:19,030 --> 00:00:22,550 But knowing that I can do it doesn't help you, does it? 7 00:00:22,550 --> 00:00:23,080 Before we get into 8 00:00:23,080 --> 00:00:26,050 work spaces though let's talk about the problem we're going to solve. 9 00:00:26,050 --> 00:00:30,160 I have a text file full of names, phone numbers, email addresses, etc. 10 00:00:30,160 --> 00:00:32,390 The problem is that it's kind of garbled and 11 00:00:32,390 --> 00:00:35,110 some of the people don't have all of their information. 12 00:00:35,110 --> 00:00:38,070 I'd like to get this sorted out, to where I can turn them into classes and 13 00:00:38,070 --> 00:00:40,880 make a nice interface for looking at my contacts. 14 00:00:40,880 --> 00:00:42,040 Regular expressions are made for 15 00:00:42,040 --> 00:00:44,740 processing text, so this should be really doable. 16 00:00:44,740 --> 00:00:45,240 Let's get to it. 17 00:00:46,280 --> 00:00:47,980 Well I guess, first thing's first. 18 00:00:47,980 --> 00:00:50,520 We need to read the file,. 19 00:00:50,520 --> 00:00:53,450 The file we want to read is this names.txt one. 20 00:00:53,450 --> 00:00:57,600 And let's go down here to our Python shell and see about doing that. 21 00:00:59,500 --> 00:01:04,830 So, we can read in a file using a really handy function that 22 00:01:04,830 --> 00:01:06,650 Python gives us called open. 23 00:01:07,730 --> 00:01:12,490 And we give it a name of the file that we wanna open, 24 00:01:12,490 --> 00:01:15,950 just in case because this file does have some UTF-8 characters. 25 00:01:15,950 --> 00:01:19,800 Let's actually give it an encoding of utf-8. 26 00:01:19,800 --> 00:01:23,800 That way it knows that it is utf-8. 27 00:01:23,800 --> 00:01:26,720 So we do that and we get the file open. 28 00:01:26,720 --> 00:01:28,070 I mean that's great, but you know, 29 00:01:28,070 --> 00:01:30,510 actually I don't wanna do this in the shell. 30 00:01:30,510 --> 00:01:32,710 I wanna do this in a, in an actual script. 31 00:01:32,710 --> 00:01:34,650 So we've used open. 32 00:01:34,650 --> 00:01:39,530 Let's just, let's get out of here and let's do this as an actual script. 33 00:01:39,530 --> 00:01:41,920 So let's go up here and we'll make a new file. 34 00:01:41,920 --> 00:01:47,360 Let's call this address_book.py, cuz we're making an address book. 35 00:01:48,620 --> 00:01:49,270 Okay. 36 00:01:49,270 --> 00:01:53,000 So inside here let's go ahead and do just what we did before. 37 00:01:54,050 --> 00:02:00,180 So names_file we're gonna open up names.txt, 38 00:02:00,180 --> 00:02:04,189 and we're gonna say that the encoding is utf-8 just in case. 39 00:02:05,260 --> 00:02:06,185 And so what we're doing here, 40 00:02:06,185 --> 00:02:10,105 names_file isn't the file, like the, the contents of the file. 41 00:02:10,105 --> 00:02:13,770 Names_file is a pointer to the file on the file system, 42 00:02:13,770 --> 00:02:15,400 which we can then do things to. 43 00:02:15,400 --> 00:02:18,010 We can read from it or close it or whatever. 44 00:02:18,010 --> 00:02:19,600 And in fact, we're gonna do that. 45 00:02:19,600 --> 00:02:20,660 We're gonna read from it. 46 00:02:20,660 --> 00:02:24,380 So let's do names_file.read. 47 00:02:24,380 --> 00:02:29,460 So that puts all the contents of names_file into data. 48 00:02:30,530 --> 00:02:35,380 Now I know that names.txt isn't really that big of a file, it's fairly small. 49 00:02:35,380 --> 00:02:37,780 But if I knew it was a really big file or 50 00:02:37,780 --> 00:02:43,070 I didn't know how big it was, there's a slightly better way of handling this. 51 00:02:43,070 --> 00:02:45,030 And I'm gonna post that in the teacher's notes. 52 00:02:45,030 --> 00:02:51,840 I wanna do a more standard, less magical and fancy version here in the course. 53 00:02:51,840 --> 00:02:54,130 So, now we have the file opened and 54 00:02:54,130 --> 00:02:56,350 we've read the file, we have all the contents of it. 55 00:02:56,350 --> 00:02:58,680 We don't need that file anymore. 56 00:02:58,680 --> 00:03:00,940 So what we're gonna do now is we want to close it so 57 00:03:00,940 --> 00:03:05,650 that we're no longer pointing to it and it's erased out of memory. 58 00:03:06,770 --> 00:03:08,580 So that's it. 59 00:03:08,580 --> 00:03:12,210 That's opening the file, reading the file, and then closing the file. 60 00:03:12,210 --> 00:03:16,450 I love when all of these actions are just really simple. 61 00:03:16,450 --> 00:03:23,360 So let's print out what we have in data, just to make sure that it's what we want. 62 00:03:23,360 --> 00:03:24,590 All right. 63 00:03:24,590 --> 00:03:28,480 So let's come down here to our console and run this. 64 00:03:29,640 --> 00:03:31,879 Python address_book.py. 65 00:03:31,879 --> 00:03:34,770 And there we go. We've got all of our names. 66 00:03:34,770 --> 00:03:35,990 So that's great. 67 00:03:35,990 --> 00:03:37,250 That's everything. 68 00:03:37,250 --> 00:03:41,280 All right, so now we wanna start on our regex stuff. 69 00:03:41,280 --> 00:03:43,180 I'm gonna get rid of that print. 70 00:03:43,180 --> 00:03:46,650 And so, everything that we're gonna do with regex, 71 00:03:46,650 --> 00:03:52,040 we need the regex library, which is called re for regular expressions. 72 00:03:52,040 --> 00:03:55,630 So, we're gonna do a match. 73 00:03:55,630 --> 00:04:01,470 Let's try and match if we look back at names.txt the first name here is my name. 74 00:04:01,470 --> 00:04:04,450 So let's see if we can match part of my name. 75 00:04:04,450 --> 00:04:10,660 So let's do print re.match and love. 76 00:04:10,660 --> 00:04:12,620 And we're gonna match that against data. 77 00:04:13,750 --> 00:04:15,190 And then, you know what, let's do another one here. 78 00:04:15,190 --> 00:04:22,220 Let's do re.match Kenneth and that's also against data. 79 00:04:22,220 --> 00:04:27,320 Now, why do our strings in here, in our matches have an r in front of them? 80 00:04:27,320 --> 00:04:31,860 Well, that tells Python that it's a raw string as opposed to 81 00:04:31,860 --> 00:04:35,960 a regular string, a well done string. 82 00:04:35,960 --> 00:04:42,140 This saves us some very confusing repeated uses of the backslash character or 83 00:04:42,140 --> 00:04:43,930 the escape character. 84 00:04:43,930 --> 00:04:47,910 We'll talk about those in a later video if, if it comes up. 85 00:04:47,910 --> 00:04:52,050 After the r, we're just looking for the two major parts of my name. 86 00:04:52,050 --> 00:04:53,460 Let's run this and see what we get. 87 00:04:55,790 --> 00:05:00,180 So we come down here, python address_book.py, and 88 00:05:00,180 --> 00:05:03,670 we get back a, a match object for the first one. 89 00:05:03,670 --> 00:05:07,360 See, we've got match object, and it matched love. 90 00:05:07,360 --> 00:05:08,140 Cool. 91 00:05:08,140 --> 00:05:10,190 But we got none for the second one. 92 00:05:10,190 --> 00:05:12,189 Why did we get none? 93 00:05:12,189 --> 00:05:20,040 Well, if you remember I said that match, match is from the beginning of the string. 94 00:05:20,040 --> 00:05:25,140 Our string starts with love, but it doesn't start with Kenneth. 95 00:05:25,140 --> 00:05:27,660 Kenneth comes after love comma space. 96 00:05:27,660 --> 00:05:30,710 So the second match can't ever, doesn't ever happen. 97 00:05:31,790 --> 00:05:35,850 But if we come back and we change this match to search, 98 00:05:38,690 --> 00:05:40,370 then now we should be able to do it. 99 00:05:40,370 --> 00:05:44,710 And look, we've got two match objects now. 100 00:05:44,710 --> 00:05:47,450 We've got one matching Love and one matching Kenneth. 101 00:05:47,450 --> 00:05:50,350 So if you're matching at the beginning of a string, use match. 102 00:05:50,350 --> 00:05:54,680 If you're matching somewhere in the string, use search. 103 00:05:54,680 --> 00:05:57,010 Now you may have all ready guessed this, but 104 00:05:57,010 --> 00:06:00,940 if you didn't, we can do these as variables. 105 00:06:00,940 --> 00:06:02,020 They're just strings. 106 00:06:02,020 --> 00:06:10,502 So we could say last_name equals love, first_name equals Kenneth. 107 00:06:10,502 --> 00:06:17,490 And then here, do last_name and first_name. 108 00:06:17,490 --> 00:06:24,310 And this would still work the same way and we still get our two matches. 109 00:06:24,310 --> 00:06:27,630 Sometimes that's a whole lot easier than writing them inside 110 00:06:27,630 --> 00:06:31,380 the re.match re.search areas. 111 00:06:31,380 --> 00:06:32,670 Toward the end of this course, 112 00:06:32,670 --> 00:06:36,270 we're going to look at another way of making patterns as variables. 113 00:06:36,270 --> 00:06:40,570 For now though, I'm just gonna stick with doing it all in one step as we, 114 00:06:40,570 --> 00:06:41,740 as we first did. 115 00:06:41,740 --> 00:06:43,420 Feel free to make the variables if you want, though. 116 00:06:44,600 --> 00:06:49,900 All right, we can open, close, and read files and match very exact patterns. 117 00:06:49,900 --> 00:06:51,160 It might not seem like much, but 118 00:06:51,160 --> 00:06:55,240 these are definitely the first steps on a long path to regular expression magic. 119 00:06:55,240 --> 00:06:58,460 In our next video, we're going to look at a more common search method and 120 00:06:58,460 --> 00:06:59,880 some more forgiving pattern options.