Sets5:42 with Kenneth Love
Sets let us combine explicit characters and escape patterns into pieces that can be repeated multiple times. They also let us specify pieces that should be left out of any matches.
[abc]- this is a set of the characters 'a', 'b', and 'c'. It'll match any of those characters, in any order, but only once each.
[a-zA-Z]- ranges that'll match any/all letters in the English alphabet in lowercase, uppercase, or both upper and lowercases.
[0-9]- range that'll match any number from 0 to 9. You can change the ends to restrict the set.
Let's see how far we've come. 0:00 We have exact matches, loose matches with escape sequences, counts and 0:01 loose counts, and position. 0:05 We can do a huge amount already. 0:07 That's awesome. 0:09 But what if I know exactly the characters that I want to match? 0:11 Or I need to make sure a certain character isn't there? 0:14 Python's regular expressions engine has a concept known as sets, and 0:17 these help us achieve exactly that. 0:20 We define a set of characters with square brackets. 0:23 Any character in the brackets will be looked for and 0:26 you can leave out duplicates. 0:28 So if we want to find the word apple, we'd have a set with aple. 0:29 We can also define ranges in our sets. 0:33 If we want all of the lowercase letters, we can define a set like a-z. 0:35 This is available for uppercase letters too, as well as numbers. 0:39 And lastly, if we start a set off with a caret, 0:44 it says to not match those characters. 0:46 So if we want to make sure our pattern doesn't have a two, we can say caret two. 0:49 We also need to look at two new flags. 0:54 The ignore case flag, which let's us match against both upper and 0:55 lower case letters at once. 0:58 And the robost flag, which lets us write our patterns out in a more natural way. 1:00 All right, lots to do, so let's get back to it. 1:04 So I think, now, I wanna get the email addresses. 1:07 The pattern that we're going to use isn't going to work for 1:10 100% of all the email addresses you will ever encounter out on the internet. 1:13 So don't look at this as being a, a cure-all for finding email addresses. 1:18 If you really wanna see just how far this has to go, search for 1:22 email address reg.exe or email address regular expression on a stack overflow, 1:28 and just enjoy the, the madness. 1:32 Okay, but we're gonna do this, we're gonna use a set. 1:37 Let's comment that out. 1:40 So sets contain all of the characters 1:41 that we're cool with finding that, that we want the regular expression to find. 1:47 So, an email address, or at least our email addresses that we had in 1:50 our text file there, can have they can have word characters in them. 1:55 They can have at symbols. 1:59 But that'll be a little bit later. 2:02 They can have numbers, and they can have underscores which is included in the \w. 2:04 They can have hyphens, put that at the beginning. 2:09 Just in case, since this specifies ranges which we'll talk about in a minute. 2:12 And they can have plus signs. 2:15 And they can have dots, they can have periods. 2:20 Okay, and then they can have an at symbol. 2:22 They have to have an at symbol, cuz it's an email address. 2:23 And then for the end it's pretty much the same set of stuff except 2:26 there won't be any plus signs cuz you can't have a plus sign in a domain name. 2:32 So, there's our pattern against data, but 2:36 the problem is we want, we don't want just one of these to show up. 2:41 We want multiples of any of those to show up, and then same here. 2:46 We could have multiples of any of those, so 2:52 we can mark our group as being available one or more times. 2:55 So, all right, let's save that and let's let's try printing that out. 3:00 And hey look at it, there they all are. 3:07 That's pretty cool. 3:09 We got them all. 3:10 It's too bad they're email addresses and not Pokemon or, we'd, anyway. 3:11 So let's do another set and 3:16 see if we can get all the instances of the word treehouse. 3:19 So, print re.findall. 3:23 And if I do the word treehouse here, which these are what I want to find 3:28 I don't have to repeat letters cuz an e is an e is an e. 3:34 So we can take out these last two e's and 3:38 then we just leave everything else in there. 3:42 So it kinda looks like treehouse. 3:44 And let's do a plus sign on this. 3:46 And now I did this with all lowercase letters, but some of the places in here, 3:50 we have Treehouse with an uppercase T. 3:54 So let's mark this. 3:58 Let's give this a flag that says, 3:59 I don't care what the case is, of the thing that you match. 4:02 So we will use the re.IGNORECASE flag. 4:06 All right. Let's let's try that. 4:10 Oh! Well. 4:13 I mean, that worked we did, we did find stuff, and I do see Treehouse in there. 4:15 But this isn't exactly what we wanted. 4:23 So let's actually go back, if you remember we talked about word boundaries, so 4:25 let's add in word boundaries to this. 4:29 So we want \b on both sides. 4:31 We want Treehouse to be this, like, standalone word. 4:35 And, you know what, 4:39 this IGNORECASE is really long, so I'm gonna actually take out. 4:40 There, so I'm just gonna use re.I. 4:47 Exact same thing as IGNORECASE, it's just the shorthand version of it. 4:48 So all right. 4:54 Let's run this again and look at that. 4:55 That's a lot better. 4:58 We've got Treehouse, Treehouse. 4:59 We've got se, which isn't exactly a match. 5:01 The and us isn't exactly a match, but we also go Treehouse and Treehouse here. 5:03 So, okay. 5:08 A lot better. 5:08 It's not perfect, but it's a lot better. 5:09 So, we know if we look at Treehouse and 5:12 we wanted to count all those letters, we know that Treehouse is nine letters long. 5:15 So, what if, instead of this plus sign, we said, 5:19 find any of these letters so long as they're in a set of nine. 5:24 There's always nine of them. 5:28 So, let's try that again and check it out. 5:31 We got Treehouse, Treehouse, Treehouse, 5:34 Treehouse cuz we have four people listed that work at Treehouse. 5:36 So, that's pretty awesome. 5:40
You need to sign up for Treehouse in order to download course files.Sign up