Escape Hatches5:23 with Kenneth Love
Exact matches are great but most of the time you'll be looking for more generic and general things. Let's talk about the escape sequences we have available in regular expressions that let us match conceptual things like whitespace and word boundaries.
\w- matches an Unicode word character. That's any letter, uppercase or lowercase, numbers, and the underscore character. In "new-releases-204",
\wwould match each of the letters in "new" and "releases" and the numbers 2, 0, and 4. It wouldn't match the hyphens.
\W- is the opposite to
\wand matches anything that isn't an Unicode word character. In "new-releases-204",
\Wwould only match the hyphens.
\s- matches whitespace, so spaces, tabs, newlines, etc.
\S- matches everything that isn't whitespace.
\d- is how we match any number from 0 to 9
\D- matches anything that isn't a number.
\b- matches word boundaries. What's a word boundary? It's the edges of word, defined by white space or the edges of the string.
\B- matches anything that isn't the edges of a word.
Two other escape characters that we didn't cover in the video are
\Z. These match the beginning and the end of the string, respectively. As we'll learn later, though,
$ are more commonly used and usually what you actually want.
So far, our text is proving harder to get useful data out of than you 0:00 probably hoped. 0:03 But never fear, we have many, many more tricks up our sleeves. 0:04 Often, when you're doing reg.exe you want to match bits that fit a certain criteria. 0:08 Like, it's a character that would be in a word or it's a number. 0:11 We have some special characters that let us match these. 0:15 Let's take a look at them. 0:17 \w matches a Unicode word character. 0:18 That's any letter, uppercase or lowercase, numbers, and the underscore character. 0:21 \w is the opposite and matches anything that isn't a Unicode word character. 0:26 \s matches whitespace. 0:33 So spaces, tabs, new lines, etc. 0:34 \S matches anything that isn't whitespace. 0:36 \d is how we match any number from zero to nine. 0:40 \D as you can probably guess by now, matches anything that isn't a number. 0:45 \b matches word boundaries. 0:49 What's a word boundary? 0:53 It's the edges of a word, defined by whitespace or the edges of the string. 0:54 \B of course, matches things that aren't the edges of a word. 0:57 We'll end up using most, if not, 1:04 all of these so don't worry about memorizing them. 1:05 I've also included them in the teacher's notes, of course. 1:08 Okay, ready to make our patterns more forgiving? 1:11 Here we go. 1:13 Okay, so looking at data, well names.txt which becomes data. 1:14 I can see that each line starts with a last name and a first name. 1:19 Well, okay, more or less. 1:25 Tim down here doesn't have a last name. 1:26 Anyway, let's use that to our advantage. 1:29 Let's come over here and I'm gonna comment these out. 1:32 [BLANK_AUDIO] 1:35 And let's do print(re.match and 1:38 we'll do \w and \w and we'll do this against data. 1:42 And all right, we should be able to see the names. 1:50 Yeah? Let's give that a try. 1:54 None. 2:00 What the heck, none? 2:01 I mean, we defined the \w for words, and we definitely have words with a comma. 2:02 So it's really too bad that \w, right, right here, matches letters. 2:07 It matches word characters. 2:13 So, this is a word character, and this is a word character, and 2:15 this is a word character, and this is a word character. 2:20 It's too bad it matches that, instead of actually words. 2:24 Now I could try and do a \w for 2:26 every letter that appears, but that's ridiculous. 2:29 We've got way too many of these that are very different links for 2:34 me to try and do that. 2:38 So, instead of trying to get the phone numbers, 2:39 the phone numbers are, are fairly predictable, especially this part. 2:42 We've got three numbers, a hyphen, and four numbers. 2:46 So let's see if we can catch that instead. 2:49 Let's see. 2:53 \d\d\d. 2:54 That's that's the first part, three numbers, and then. 2:57 One, two, three, four \d is there. 3:01 That should be easier so let's, let's try that. 3:03 And we get none. 3:09 [BLANK_AUDIO] 3:11 The reason we got none on this is because we did match and we need to do search. 3:15 So, save that, run that again. 3:20 And hey, look. There we go. 3:24 We got a match for the first phone number. 3:25 We got the 555-5555. 3:27 We got, we got the traditional movie phone number. 3:29 So we'll talk about how to do this. 3:33 I mean, that only gave us our first line. 3:35 Just, just my line here. 3:38 We'll talk about how to do this for multiple lines later on. 3:40 So, what if I want the area code as well? 3:43 I want the paren 555 paren. 3:47 So, let's add in the parenthesis and 3:52 then three more numbers, parenthesis, and a space. 3:56 Now we have a problem, though. 4:00 Because the parenthesis in regular expressions have a special meaning. 4:02 They define a group, which we're gonna do in a later video. 4:07 Since they're going to define a group, I don't want them to act as a group. 4:11 I want them to act as actual parenthesis, which means I need to escape them. 4:16 So that's why I put in this forward slash. 4:20 This is part of where doing the raw string helps because trying to escape 4:23 characters without raw strings often leads to you having the multiple backslashes so 4:28 that your backslash actually shows up as an escape character. 4:32 It's just really weird. 4:36 Stick with doing the raw strings. 4:37 Okay, so let's try this one out. 4:39 And there we go, we got the full phone number. 4:44 Again, just from my line, we'll find it for more lines later on. 4:47 Using more accepting patterns is pretty much the only way to 4:52 avoid madness when working with content that's not 100% predictable. 4:54 The escape sequences are also really useful when defining regular 4:59 expressions for say, URLs, like Jango does. 5:01 Okay. So now we can match, search and 5:05 find all the non-overlapping occurrences of our patterns. 5:07 And we can use generic patterns and exact patterns. 5:10 What if we're just tired of typing so much? 5:13 What if we wanna say, yeah, this has at least one number, 5:15 or, this chunk has to have at least five word characters? 5:18 Well, come back for the next video then. 5:21
You need to sign up for Treehouse in order to download course files.Sign up