1 00:00:00,280 --> 00:00:01,570 Let's see how far we've come. 2 00:00:01,570 --> 00:00:05,470 We have exact matches, loose matches with escape sequences, counts and 3 00:00:05,470 --> 00:00:07,500 loose counts, and position. 4 00:00:07,500 --> 00:00:09,570 We can do a huge amount already. 5 00:00:09,570 --> 00:00:11,250 That's awesome. 6 00:00:11,250 --> 00:00:14,450 But what if I know exactly the characters that I want to match? 7 00:00:14,450 --> 00:00:17,430 Or I need to make sure a certain character isn't there? 8 00:00:17,430 --> 00:00:20,690 Python's regular expressions engine has a concept known as sets, and 9 00:00:20,690 --> 00:00:22,319 these help us achieve exactly that. 10 00:00:23,360 --> 00:00:26,120 We define a set of characters with square brackets. 11 00:00:26,120 --> 00:00:28,030 Any character in the brackets will be looked for and 12 00:00:28,030 --> 00:00:29,410 you can leave out duplicates. 13 00:00:29,410 --> 00:00:32,320 So if we want to find the word apple, we'd have a set with aple. 14 00:00:33,740 --> 00:00:35,720 We can also define ranges in our sets. 15 00:00:35,720 --> 00:00:39,907 If we want all of the lowercase letters, we can define a set like a-z. 16 00:00:39,907 --> 00:00:42,839 This is available for uppercase letters too, as well as numbers. 17 00:00:44,090 --> 00:00:46,490 And lastly, if we start a set off with a caret, 18 00:00:46,490 --> 00:00:49,050 it says to not match those characters. 19 00:00:49,050 --> 00:00:52,520 So if we want to make sure our pattern doesn't have a two, we can say caret two. 20 00:00:54,080 --> 00:00:55,940 We also need to look at two new flags. 21 00:00:55,940 --> 00:00:58,890 The ignore case flag, which let's us match against both upper and 22 00:00:58,890 --> 00:01:00,630 lower case letters at once. 23 00:01:00,630 --> 00:01:04,540 And the robost flag, which lets us write our patterns out in a more natural way. 24 00:01:04,540 --> 00:01:07,310 All right, lots to do, so let's get back to it. 25 00:01:07,310 --> 00:01:10,790 So I think, now, I wanna get the email addresses. 26 00:01:10,790 --> 00:01:13,740 The pattern that we're going to use isn't going to work for 27 00:01:13,740 --> 00:01:18,720 100% of all the email addresses you will ever encounter out on the internet. 28 00:01:18,720 --> 00:01:22,630 So don't look at this as being a, a cure-all for finding email addresses. 29 00:01:22,630 --> 00:01:28,470 If you really wanna see just how far this has to go, search for 30 00:01:28,470 --> 00:01:32,790 email address reg.exe or email address regular expression on a stack overflow, 31 00:01:32,790 --> 00:01:37,050 and just enjoy the, the madness. 32 00:01:37,050 --> 00:01:40,220 Okay, but we're gonna do this, we're gonna use a set. 33 00:01:40,220 --> 00:01:41,600 Let's comment that out. 34 00:01:41,600 --> 00:01:47,020 So sets contain all of the characters 35 00:01:47,020 --> 00:01:50,326 that we're cool with finding that, that we want the regular expression to find. 36 00:01:50,326 --> 00:01:55,000 So, an email address, or at least our email addresses that we had in 37 00:01:55,000 --> 00:01:59,960 our text file there, can have they can have word characters in them. 38 00:01:59,960 --> 00:02:02,250 They can have at symbols. 39 00:02:02,250 --> 00:02:04,570 But that'll be a little bit later. 40 00:02:04,570 --> 00:02:09,210 They can have numbers, and they can have underscores which is included in the \w. 41 00:02:09,210 --> 00:02:12,350 They can have hyphens, put that at the beginning. 42 00:02:12,350 --> 00:02:15,830 Just in case, since this specifies ranges which we'll talk about in a minute. 43 00:02:15,830 --> 00:02:18,530 And they can have plus signs. 44 00:02:20,080 --> 00:02:22,270 And they can have dots, they can have periods. 45 00:02:22,270 --> 00:02:23,890 Okay, and then they can have an at symbol. 46 00:02:23,890 --> 00:02:26,550 They have to have an at symbol, cuz it's an email address. 47 00:02:26,550 --> 00:02:32,410 And then for the end it's pretty much the same set of stuff except 48 00:02:32,410 --> 00:02:35,450 there won't be any plus signs cuz you can't have a plus sign in a domain name. 49 00:02:36,680 --> 00:02:41,220 So, there's our pattern against data, but 50 00:02:41,220 --> 00:02:46,100 the problem is we want, we don't want just one of these to show up. 51 00:02:46,100 --> 00:02:51,660 We want multiples of any of those to show up, and then same here. 52 00:02:52,750 --> 00:02:55,390 We could have multiples of any of those, so 53 00:02:55,390 --> 00:02:59,290 we can mark our group as being available one or more times. 54 00:03:00,418 --> 00:03:07,600 So, all right, let's save that and let's let's try printing that out. 55 00:03:07,600 --> 00:03:09,450 And hey look at it, there they all are. 56 00:03:09,450 --> 00:03:10,450 That's pretty cool. 57 00:03:10,450 --> 00:03:11,430 We got them all. 58 00:03:11,430 --> 00:03:16,500 It's too bad they're email addresses and not Pokemon or, we'd, anyway. 59 00:03:16,500 --> 00:03:19,950 So let's do another set and 60 00:03:19,950 --> 00:03:23,226 see if we can get all the instances of the word treehouse. 61 00:03:23,226 --> 00:03:28,694 So, print re.findall. 62 00:03:28,694 --> 00:03:33,400 And if I do the word treehouse here, which these are what I want to find 63 00:03:34,940 --> 00:03:38,510 I don't have to repeat letters cuz an e is an e is an e. 64 00:03:38,510 --> 00:03:42,280 So we can take out these last two e's and 65 00:03:42,280 --> 00:03:44,310 then we just leave everything else in there. 66 00:03:44,310 --> 00:03:46,380 So it kinda looks like treehouse. 67 00:03:46,380 --> 00:03:50,360 And let's do a plus sign on this. 68 00:03:50,360 --> 00:03:54,830 And now I did this with all lowercase letters, but some of the places in here, 69 00:03:54,830 --> 00:03:58,100 we have Treehouse with an uppercase T. 70 00:03:58,100 --> 00:03:59,640 So let's mark this. 71 00:03:59,640 --> 00:04:02,030 Let's give this a flag that says, 72 00:04:02,030 --> 00:04:05,320 I don't care what the case is, of the thing that you match. 73 00:04:06,710 --> 00:04:09,310 So we will use the re.IGNORECASE flag. 74 00:04:10,710 --> 00:04:13,379 All right. Let's let's try that. 75 00:04:13,379 --> 00:04:15,780 Oh! Well. 76 00:04:15,780 --> 00:04:21,670 I mean, that worked we did, we did find stuff, and I do see Treehouse in there. 77 00:04:23,010 --> 00:04:25,030 But this isn't exactly what we wanted. 78 00:04:25,030 --> 00:04:29,100 So let's actually go back, if you remember we talked about word boundaries, so 79 00:04:29,100 --> 00:04:31,010 let's add in word boundaries to this. 80 00:04:31,010 --> 00:04:35,410 So we want \b on both sides. 81 00:04:35,410 --> 00:04:38,490 We want Treehouse to be this, like, standalone word. 82 00:04:39,910 --> 00:04:40,510 And, you know what, 83 00:04:40,510 --> 00:04:45,680 this IGNORECASE is really long, so I'm gonna actually take out. 84 00:04:47,010 --> 00:04:48,880 There, so I'm just gonna use re.I. 85 00:04:48,880 --> 00:04:53,130 Exact same thing as IGNORECASE, it's just the shorthand version of it. 86 00:04:54,250 --> 00:04:55,060 So all right. 87 00:04:55,060 --> 00:04:58,220 Let's run this again and look at that. 88 00:04:58,220 --> 00:04:59,040 That's a lot better. 89 00:04:59,040 --> 00:05:01,340 We've got Treehouse, Treehouse. 90 00:05:01,340 --> 00:05:03,780 We've got se, which isn't exactly a match. 91 00:05:03,780 --> 00:05:08,170 The and us isn't exactly a match, but we also go Treehouse and Treehouse here. 92 00:05:08,170 --> 00:05:08,840 So, okay. 93 00:05:08,840 --> 00:05:09,810 A lot better. 94 00:05:09,810 --> 00:05:12,270 It's not perfect, but it's a lot better. 95 00:05:12,270 --> 00:05:15,670 So, we know if we look at Treehouse and 96 00:05:15,670 --> 00:05:19,610 we wanted to count all those letters, we know that Treehouse is nine letters long. 97 00:05:19,610 --> 00:05:24,780 So, what if, instead of this plus sign, we said, 98 00:05:24,780 --> 00:05:28,210 find any of these letters so long as they're in a set of nine. 99 00:05:28,210 --> 00:05:30,020 There's always nine of them. 100 00:05:31,570 --> 00:05:34,720 So, let's try that again and check it out. 101 00:05:34,720 --> 00:05:36,480 We got Treehouse, Treehouse, Treehouse, 102 00:05:36,480 --> 00:05:40,270 Treehouse cuz we have four people listed that work at Treehouse. 103 00:05:40,270 --> 00:05:42,260 So, that's pretty awesome.