1 00:00:00,130 --> 00:00:05,240 Let's try a slightly harder one, slightly weirder one perhaps. 2 00:00:06,740 --> 00:00:08,020 So let's actually, let's see. 3 00:00:08,020 --> 00:00:14,890 Let's comment these two both out, and let's take our email address one. 4 00:00:16,590 --> 00:00:19,830 And I wanna match all the email address, just like we did before. 5 00:00:19,830 --> 00:00:25,986 But if the email undress ends in .gov, I want to leave that part off. 6 00:00:25,986 --> 00:00:30,610 Just pretend I have a good reason for this cuz I, I really don't. 7 00:00:30,610 --> 00:00:35,840 So, all right, this sounds like a really good place for us to use a negative set. 8 00:00:35,840 --> 00:00:37,730 And we can also write this out. 9 00:00:37,730 --> 00:00:41,970 I mean, this is really, there's a lot to this. 10 00:00:41,970 --> 00:00:45,680 So, let's leave ourselves some comments, make this a little bit easier. 11 00:00:45,680 --> 00:00:48,650 So okay, first of all, yeah. 12 00:00:48,650 --> 00:00:51,590 We can definitely use a negative set here. 13 00:00:51,590 --> 00:00:56,170 So, let's make this a multiline string. 14 00:00:58,330 --> 00:01:00,690 And we gotta end that multiline string. 15 00:01:02,230 --> 00:01:06,490 You know what, we need to make this four spaces. 16 00:01:09,070 --> 00:01:10,400 There we go, all right. 17 00:01:11,470 --> 00:01:18,590 And we'll end our multiline string, and then we'll do stuff as usual against data. 18 00:01:18,590 --> 00:01:20,860 All right. So, let's take this. 19 00:01:20,860 --> 00:01:23,141 I actually don't wanna catch the part before. 20 00:01:23,141 --> 00:01:27,201 I just wanna get the the e-mail address. 21 00:01:27,201 --> 00:01:32,027 So, let's do a \b and 22 00:01:32,027 --> 00:01:39,901 then an @, and then that part. 23 00:01:39,901 --> 00:01:43,101 And I don't care how many things are there. 24 00:01:43,101 --> 00:01:48,670 So find a word boundary, just leaving myself a little note here, 25 00:01:48,670 --> 00:01:52,330 an @, and then any number of characters. 26 00:01:55,610 --> 00:02:01,130 All right, then what I want to ignore is gov, and 27 00:02:01,130 --> 00:02:03,590 I don't wanna get that tab that's in there. 28 00:02:03,590 --> 00:02:07,200 You can't necessarily see it, but 29 00:02:07,200 --> 00:02:12,580 the space between each of these things is a tab character. 30 00:02:12,580 --> 00:02:14,900 And I know there's a tab character right here, and 31 00:02:14,900 --> 00:02:17,770 it just might catch it, so let's leave that off. 32 00:02:19,440 --> 00:02:21,550 So one or more of those is fine. 33 00:02:21,550 --> 00:02:29,270 And let's leave another comment here of ignore, wow, wow. 34 00:02:29,270 --> 00:02:36,070 Ignore one or more instances of the letters g, 35 00:02:36,070 --> 00:02:39,160 o, or v and a tab. 36 00:02:41,130 --> 00:02:41,630 All right. And 37 00:02:41,630 --> 00:02:49,760 then we have another b here, so match another word boundary, all right. 38 00:02:49,760 --> 00:02:50,548 And then we do data. 39 00:02:50,548 --> 00:02:57,620 Now, I've done a flag here, which is that I've done multiple lines. 40 00:02:57,620 --> 00:02:59,500 So I need to use this VERBOSE flag. 41 00:03:00,570 --> 00:03:04,610 And then, since we've got gov in there, and we've got it in lowercase. 42 00:03:04,610 --> 00:03:09,720 Just in case there was an uppercase version, I'd want to add on the flag re.I. 43 00:03:09,720 --> 00:03:15,470 And we add multiple flags with the pipe symbol in between each of the flags. 44 00:03:15,470 --> 00:03:17,280 It's a little weird. 45 00:03:17,280 --> 00:03:19,240 It's just something you get used to. 46 00:03:19,240 --> 00:03:20,560 You just kinda have to remember it. 47 00:03:21,760 --> 00:03:23,750 So, all right, let's try that out. 48 00:03:25,720 --> 00:03:26,364 And there we go, 49 00:03:26,364 --> 00:03:30,090 we've got @teamtreehouse.com, @teamtreehouse.com, blah, blah, blah. 50 00:03:30,090 --> 00:03:34,500 And then we get over here, and we've got us, this was supposed to be us.gov, and 51 00:03:34,500 --> 00:03:36,710 we've got just us. 52 00:03:36,710 --> 00:03:39,320 And then we were supposed to have empire.gov, as we've got up here, and 53 00:03:39,320 --> 00:03:41,010 we've just have empire. 54 00:03:41,010 --> 00:03:43,930 And we're supposed to have spain.gov, and we just got spain. 55 00:03:43,930 --> 00:03:46,180 So, that's pretty cool, 56 00:03:46,180 --> 00:03:50,130 we got all the email addresses, but we left off the .gov on two of them. 57 00:03:51,180 --> 00:03:54,020 So, I think that's pretty cool, pretty handy. 58 00:03:56,420 --> 00:04:00,630 Let's try another one with our VERBOSE flag, 59 00:04:00,630 --> 00:04:03,090 just to get used to doing our VERBOSE flag. 60 00:04:04,200 --> 00:04:05,170 Gonna comment this out. 61 00:04:05,170 --> 00:04:08,230 All right. 62 00:04:08,230 --> 00:04:14,730 So let's try another verbose pattern that will match our our names. 63 00:04:14,730 --> 00:04:18,600 It'll also match our jobs, but it's still a good practice. 64 00:04:18,600 --> 00:04:23,041 So we're gonna do print(re.findall. 65 00:04:23,041 --> 00:04:26,740 And then we're gonna do a multi-line string, cuz we're gonna use verbose. 66 00:04:26,740 --> 00:04:33,770 So let's do \b -\w. 67 00:04:33,770 --> 00:04:40,998 So that would be Find a word boundary 1+ 68 00:04:40,998 --> 00:04:47,220 hyphens or word characters. 69 00:04:47,220 --> 00:04:48,170 We'll just say characters. 70 00:04:49,810 --> 00:04:52,210 And a comma cuz that comma's in there. 71 00:04:52,210 --> 00:04:54,030 It has to find that comma. 72 00:04:54,030 --> 00:04:57,300 And then let's have it find, find whitespace. 73 00:05:00,030 --> 00:05:01,070 Find 1 whitespace. 74 00:05:02,340 --> 00:05:06,036 And then let's have it find another hyphen, a w, or 75 00:05:06,036 --> 00:05:07,731 a space as part of our set. 76 00:05:07,731 --> 00:05:10,897 We'll talk about why that's different in just a second. 77 00:05:10,897 --> 00:05:19,082 1+ hyphens and characters, and explicit spaces. 78 00:05:21,120 --> 00:05:25,630 And then I want it to not find tabs or new line characters. 79 00:05:25,630 --> 00:05:29,971 Ignore tabs and newlines. 80 00:05:29,971 --> 00:05:34,489 And then we wanna close this, we're gonna run this against data, and 81 00:05:34,489 --> 00:05:36,031 we're gonna do re.x. 82 00:05:36,031 --> 00:05:36,631 All right. 83 00:05:36,631 --> 00:05:40,130 So let's talk about this one for a second before we run it. 84 00:05:40,130 --> 00:05:43,190 So, when we do the verbose flag, 85 00:05:43,190 --> 00:05:49,100 which re.x if you didn't guess is the short hand version of re.VERBOSE. 86 00:05:49,100 --> 00:05:54,250 When we do the verbose ones, the regular expression engine ignores all of 87 00:05:54,250 --> 00:05:57,260 the spaces that are just out in our pattern. 88 00:05:57,260 --> 00:06:00,087 So like, these spaces here and 89 00:06:00,087 --> 00:06:05,980 these spaces here are completely ignored, as is this comment. 90 00:06:05,980 --> 00:06:09,880 So we have to mark those with this \s. 91 00:06:09,880 --> 00:06:12,480 That, and, and that is whitespace. 92 00:06:12,480 --> 00:06:17,080 So that matches spaces, it matches tabs, it matches new lines. 93 00:06:17,080 --> 00:06:18,130 It matches all sorts of stuff. 94 00:06:18,130 --> 00:06:19,780 Actually, I don't remember if it matches new lines or 95 00:06:19,780 --> 00:06:24,220 not, but it matches spaces and tabs, and other characters like that. 96 00:06:24,220 --> 00:06:28,690 If you wanna go look up like, half tab or letter space and 97 00:06:28,690 --> 00:06:31,950 stuff like that, there's all sorts of these spaces that are available. 98 00:06:31,950 --> 00:06:33,410 So it matches all of those. 99 00:06:33,410 --> 00:06:36,880 But inside of a set, we can use an explicit space and 100 00:06:36,880 --> 00:06:40,110 that will only match spaces. 101 00:06:40,110 --> 00:06:42,630 It won't match tabs or newlines or whatever. 102 00:06:42,630 --> 00:06:46,090 And then down here we want to ignore tab and newline. 103 00:06:46,090 --> 00:06:50,730 Now, why didn't we have to use re.i in this one, or re.ignorecase. 104 00:06:50,730 --> 00:06:54,380 The reason's because we're not matching any explicit characters. 105 00:06:54,380 --> 00:06:58,780 We're not matching, like, the letter t, that may be uppercase or lowercase. 106 00:06:59,860 --> 00:07:03,590 Since we're not matching those things, we're matching more generic stuff like 107 00:07:03,590 --> 00:07:09,100 word characters, then we can use, or we can, we can leave off re.i. 108 00:07:09,100 --> 00:07:11,600 . So let's run this and see what it does. 109 00:07:13,910 --> 00:07:15,840 And I forgot another character. 110 00:07:15,840 --> 00:07:18,270 We should have a plus sign there as well. 111 00:07:19,280 --> 00:07:20,250 So let's run that again. 112 00:07:21,860 --> 00:07:22,420 There we go. 113 00:07:22,420 --> 00:07:26,580 So now we've got Kenneth Love and Teacher Treehouse, Dave MacFarlane, or 114 00:07:26,580 --> 00:07:29,100 MacFarlane, Dave, Teacher Treehouse, and so on. 115 00:07:29,100 --> 00:07:32,650 So we got the names, and we got the where they work. 116 00:07:34,070 --> 00:07:39,040 So, of course, if we want to get Tim in there, we need to change this to a star. 117 00:07:40,260 --> 00:07:44,420 Run this again and we should get Tim. 118 00:07:44,420 --> 00:07:45,490 I don't see Tim actually. 119 00:07:47,220 --> 00:07:52,470 So Tim's not in there, but we will fix that later. 120 00:07:52,470 --> 00:07:55,800 We'll select everybody before we get to the end of this. 121 00:07:55,800 --> 00:07:59,520 As you can tell though, it really, really helps breaking up our patterns or 122 00:07:59,520 --> 00:08:00,450 multiple lines. 123 00:08:00,450 --> 00:08:04,470 And being able to annotate each line with a comment, so that we remember what we're 124 00:08:04,470 --> 00:08:09,890 doing, what we're looking for and how to make things again. 125 00:08:09,890 --> 00:08:12,370 We have a ton of choices now when we write patterns. 126 00:08:12,370 --> 00:08:14,930 They can be as flexible or strict as we need. 127 00:08:14,930 --> 00:08:18,050 Our next video will cover the real meat of what'll make our regular expressions 128 00:08:18,050 --> 00:08:20,030 capable of solving our immediate problem.