Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Regular Expressions are common in just about every programming language, and developers have leaned on their power for many solutions. We use them for searching, replacing and validating that text meets certain formats.
[MUSIC]
0:00
Hello, I'm Craig and I'm a developer.
0:04
Regular expressions are an extremely
handy tool to have at your disposal.
0:08
They're also incredibly intimidating
looking if you don't know how or
0:12
why someone is using them.
0:15
Regular expressions are common in just
about every programming language.
0:17
And developers have leaned on
their power for many solutions.
0:20
We use them for searching, replacing, and
0:24
validating that text
meets certain formats.
0:26
Java has several methods that you
will encounter that expect a regular
0:28
expression as a parameter,
0:31
as well as a few very common patterns
that you will more than likely encounter.
0:32
We've used them in a couple
of our projects, and
0:37
I figured it was about time we did
a little deeper exploration of their use,
0:39
and add them to your bag of tricks.
0:42
Are you ready?
0:44
Let's get started.
0:45
Regular expressions, or regexes, as they
are often referred, are one of those
0:46
concepts that are hard to understand
until you actually need to use them.
0:50
So I was thinking we'd run
through some hands on examples so
0:54
you can get a vibe of
how they might be used.
0:57
Now, when I'm writing a regular
expression, I like to try and
1:00
find a visual editor for
my specific language.
1:02
And there's tons of these online.
1:05
So let's do a quick search for
one in Java.
1:06
So, let's say, Java.
1:09
Regular expression.
1:12
And we'll say, visual.
1:15
Let's just see what happens with that.
1:17
Okay, great, this first one here,
Java Visual Regular Expression Tester.
1:19
Perfect.
1:22
Okay, so the way the tool works is this.
1:24
You put a string in there
where it says Target text, and
1:26
then you write your regular expression or
the pattern that you're looking for.
1:29
So regex is all about pattern matching.
1:33
So let's assume that I was writing some
software that was gonna parse a cover
1:36
letter of a job applicant.
1:40
So let's say that it looked
something like this.
1:41
Let's say that it said I
am a full-stack developer.
1:43
I am fluent in JavaScript,
1:50
HTML, CSS and
1:57
PostgreSQL.
2:01
So let's say that we are wanting to find
people who are confident in their skills.
2:05
And so
we could type the literal word fluent.
2:10
And this is a regex.
2:14
And what's gonna happen
is it's gonna match.
2:15
And so if you see down here on the bottom,
it shows where we matched.
2:16
Okay, so see it's highlighted.
2:20
Fluent is highlighted there.
2:21
Now, that's probably not blowing you away.
2:24
The real power comes in
when we expand our pattern.
2:27
So, everybody knows that recruiters
are searching for that magic unicorn.
2:30
Also known as the full-stack developer.
2:34
But, the problem is, that it's spelled so
2:36
many ways, though sometimes it's
hyphenated, sometimes it's spaced.
2:38
Let's just go head and look for it.
2:42
So first if we come in here and
we type full-stack.
2:43
Cool, it matches.
2:48
But what if the cover letter was
written a little bit differently right?
2:49
What if they didn't hyphenate it?
2:52
What if they said full stack?
2:53
Okay, so now we're not matching.
2:55
Yikes, now what?
2:57
Well the good news is that
there's metacharacters that
2:58
can help explain what we're looking for.
3:01
So let's add one.
3:03
I'll show you this one first.
3:03
There's a character, the period, or
the dot and it means any character.
3:05
So let's go in here and we'll remove this,
and we'll search for any character.
3:10
Cool.
So now it's finding full stack and if we
3:15
put in here, we put in the hyphen, it's
also still matching on the full-stack.
3:17
But, what happens if we
concatenate those words?
3:22
Let's get rid of this, so
it's fullstack like that.
3:25
[SOUND] Wwe're not matching again.
3:28
Now what do we do?
3:29
Well, there's another metacharacter,
and it's an asterisk, or
3:31
a star, and it means one or more of these.
3:34
So let's go ahead and try that.
3:37
So we'll say .*, so
with the asterisk there also.
3:38
So, it also looks like that
will now handle multiple spaces
3:42
cuz it's saying any character and
it's also doing that.
3:46
It looks like we might be over-matching.
3:50
What if I did this?
3:53
I am full of baloney and
3:55
I eat pancakes by the stack.
3:59
If this starts jumping around here,
you can just kinda grab this and
4:05
make it go back.
4:08
CSS problem.
4:09
But you know that cuz you're full stack.
4:11
See that's matching that completely and
that is totally not what we want.
4:14
So, there are really only a couple of
characters that we'd allow in there.
4:18
And for that, we'd build what
is known as a character class.
4:21
Character classes are defined
by using square brackets.
4:24
And then you put any characters that
you want to be allowed inside there.
4:27
So, let's see.
4:31
We want to have a space and a dash.
4:32
Okay?
4:39
Cool.
So, now we're not matching anymore.
4:39
So, let's go ahead and
let's bring that back.
4:42
Let's say.
4:43
So that works, and that works, and if we
did something like this, it still works.
4:49
But, it's just those two and if I put
anything else in there, it doesn't work.
4:54
Awesome.
4:57
So the character class is nice.
5:03
I can put any sort of letter that
I wanna match in there, right.
5:05
So abcdefg and there you see that it's
matching any one of those letters.
5:07
You obviously don't wanna
type that out each time.
5:13
So there's a great thing called ranges.
5:15
So inside of the square brackets,
5:16
if I do a-z, that would be all
of the letters between a and z.
5:18
You'll notice that some of the letters
that aren't being highlights, so
5:22
you can, actually it's case sensitive.
5:27
So if I do A through Z or
5:29
a through z, there we go, and
let's go ahead and put that * back.
5:32
There we go, and you'll see that now
instead of highlight each letter,
5:38
it's highlight each word,
5:40
because again, that's zero or
more matches is what the star means.
5:41
So that is saying,
attempt to match every one or
5:45
more characters that match uppercase
A through Z, or lowercase a through z.
5:48
Now, basically that is just saying match
all the characters in a word, right?
5:52
Like, ignore white space and
special character.
5:55
Well, guess what,
there's a better way to state that.
5:58
So if you just say \w*.
6:00
There we go.
6:04
That's a shortcut for word.
6:05
So you're starting to see why regular
expressions are hard to read.
6:06
They're special metacharacters that
would make no sense at all before
6:09
you read them or seen them.
6:14
I mean, what the heck is \w*?
6:15
You should not expect to look
at one of these expressions and
6:19
expect to know what it's doing,
especially right off the bat.
6:22
The trick is knowing that
it's okay to not know.
6:25
I know a few Perl programmers that can
write gnarly regexes without documentation
6:29
but literally every other
programmer I've worked with,
6:33
works with the documentation up.
6:36
So, let's pop one up.
6:38
I'm gonna use the Java documentation.
6:38
It's in the teacher's notes.
6:40
This is the pattern class.
6:44
And if we scroll down here,
6:46
there's a nice listing here of
all the different constructs.
6:47
Okay, so here's the character classes
that we were just playing with.
6:51
Okay, here's a cool one here.
6:54
It says negation.
6:56
So it looks like there's
a caret at the front.
6:57
So let's go ahead and flip back to our.
7:00
So we'll say highlight everything that
isn't a character based in a word.
7:03
Nice.
Now it's highlighting all the non-word
7:12
characters.
7:14
Awesome.
7:15
So the caret at the front inside
a of character class means not.
7:16
Okay, so let's flip back.
7:21
Okay, so
here's some predefined character classes.
7:24
Okay.
7:28
Here's the \w that we saw.
7:29
A word character, a through z.
7:30
Lowercase, uppercase.
7:32
Underscore and 0 through 9.
7:34
That's something that we
didn't think about in ours.
7:36
So the \w there,
the predefined character class, is great.
7:39
Oh look, there's for one for
what we were just trying to say.
7:42
A non-word character and
that can be shortened to uppercase W.
7:45
So, if we come back here and
just say \W, we get the same thing.
7:49
I also saw that we could work with
numbers in that documentation.
7:54
Why don't we change our example up a bit?
7:58
One quick case I can think of is, let's
say we have a website with a purchase form
8:01
and it's based on the United States and
8:06
they're not gonna do shipping
outside of the country.
8:08
So, let's use a US based zip code.
8:11
And those are five digits.
8:13
So, let's go ahead and we'll put
this is our number up here, 90210.
8:15
So remember that we can do ranges, right.
8:22
So we got 0 through 9.
8:24
Cool.
So that's finding each one.
8:28
Let's say that we wanted to find zero or
more repeating numbers.
8:30
We would do the star right?
8:33
Cool.
8:36
Now notice how it changed all green.
8:37
Now that means that the entire
pattern is matched.
8:39
So, I saw on those dots when
we were back there that
8:42
there is also one instead of zero
through nine, we have \d for digit.
8:45
Cool.
8:50
Now there is a bit of a problem.
8:51
What if there are fewer numbers,
I mean, like this, right?
8:53
902.
8:56
That is not a valid zip code,
but we're saying that it is.
8:57
Well, guess what, you can specify that.
9:01
It uses curly braces to be a quantifier.
9:04
So, what we really wanna have,
is we wanna say 5 digits.
9:07
So now, we'll see that it's not matching,
but if I add five digits, it does, and
9:12
if I add more, it matches just those five.
9:16
I think that we have a working validator.
9:21
So let's go ahead and do a real quick
look on how we can do this in Java.
9:24
I've got a workspace link
in the teacher's notes.
9:28
Okay, so we've imported one of those
consoles and we're gonna go ahead and
9:32
say String zipCode.
9:37
Console.readLine.
9:39
We're gonna take input from that saying,
Enter your zipcode.
9:41
And if the zip code, which is a string.
9:50
Now all strings have a property
on them called matches.
9:54
So we're gonna say matches and
9:57
that parameter that we pass
it is a regular expression.
9:59
So let's go ahead and let's say slash d,
curly brace, for the quantifier five.
10:03
So anytime that there are five.
10:09
Now there's something that
we need to know here.
10:11
This slash-d actually means something
in JAVA, that means, escape d.
10:13
But, we don't want that.
10:19
We want the regular expression,
so we need to double escape.
10:20
It's a little big ugly, right?
10:23
So what we're saying is
actually write a backslash.
10:24
Don't try to do the special
character in JAVA.
10:27
Does that make sense?
10:30
Cuz we are writing in the regular
expression language and
10:31
we need to actually just write
a backslash, which is an escape character.
10:33
Okay.
10:39
So we'll say system.out.printf and
10:40
we'll say this is a valid zip code, right?
10:44
Because it's coming in.
10:50
And I'll do a new one.
10:52
And then we'll pass on the zip code.
10:56
So that's gonna replace this %s here.
10:59
And because the matches
will return true if it is.
11:01
And then otherwise, of course,
we will just copy this.
11:04
And we'll say, is not a valid zip code.
11:12
Okay, so let's go ahead and
we'll run that.
11:20
Again, that run statement,
bring this up a little bit.
11:23
Run statement is clear,
11:26
and we're gonna join a couple
of commands together.
11:30
The file's called Reggie.
11:35
And we'll start it, java Reggie.
11:39
So we'll clear the screen,
compile and then run the program.
11:42
So let's enter a valid zip code,
or an invalid one first.
11:47
So it'll say 902.
And it says 902 is not a valid zip code.
11:51
So this matches up here.
11:56
It returned false.
11:59
So if false, and then it came and
fell through there.
12:00
So let's do that again.
12:02
90210, it's a valid zip code.
12:06
Awesome.
12:08
All right, what if we put some
other stuff in there, right?
12:09
So let's assume that in
our target text here.
12:14
Let's say that we prompted for a zip code
and the person wrote next door to Dylan
12:18
in 90210, I wish.
12:26
Now that's not valid,
12:31
but we are saying that it is
because it is matching down there.
12:32
So it's matching on this.
12:34
So let's pop over to the dock and
see if there's a way to fix this problem.
12:38
We only want the text.
12:41
So let's scroll down here a little bit.
12:44
Oh, there's a thing here
called boundary matchers.
12:45
So it looks like carrot marks
the beginning of the line and
12:50
dollar sign marks the end of the line.
12:53
Let's go ahead and put that in.
12:56
So if we put at the front of our
regular expression, let's put a carat.
12:59
Awesome.
Now it's not matching down here anymore.
13:07
Cool.
13:09
So let's go ahead and we can get
rid of the front of the line, so
13:09
we are saying it must start with that.
13:12
Okay, with the trailing garbage,
it still matches.
13:15
So let's put that end of line marker,
which was the dollar sign at the end here.
13:18
Awesome.
13:23
Now it doesn't match, but
13:24
if we take off this junk in the trunk,
boom, we got it again.
13:25
Great.
13:31
Now I wanna point out that
you're not going crazy.
13:32
We did in the past, just the recent past,
use that carat symbol to make it mean NOT.
13:34
And that was inside of
the character class.
13:39
And that only works as the first
character in the character class.
13:41
Okay?
It's one of those confusing things.
13:45
And context matters when
reading regular expressions.
13:46
So let's read it.
13:49
The line starts with five digits and
then the line ends.
13:50
Perfect.
13:54
Oh, one more thing.
13:55
The US zipcode pattern also allows for
13:57
a four digit block of
numbers after a hyphen.
14:00
It helps to describe
the geographic segment.
14:02
It's optional.
14:05
But, right now, if somebody used that,
let's say that they added that.
14:06
It would fail.
14:09
So let's help it pass.
14:12
So, that's a dash and
then four more digits, right?
14:14
Cool, so it's matching.
14:20
But now our original zip code doesn't.
14:22
We want that last bit to be optional.
14:26
Now, of course there is
a way to stick at that.
14:29
So, if you wanna group the optional bits
by wrapping that in a parenthesis and
14:31
we follow that with a question mark.
14:36
So let's put the matcher back up there.
14:38
So, we're gonna wrap this in parentheses
and follow it with a question mark.
14:41
So that says this grouping
here is optional, okay?
14:46
So if we get rid of this.
14:50
It still works, and this also matches.
14:53
Cool, so
let's pop that back in our program.
14:58
Copy that.
15:02
I'm gonna come over to here.
15:04
Zip code matches that.
15:06
And remember we need to
escape our backslashes.
15:09
Cool.
So let's run it again and check.
15:16
I'm gonna use the up arrow.
15:18
90 fails.
15:22
And then 90210 5309.
15:26
That works.
15:28
And also,
the one in the middle short, 90210.
15:31
Awesome, we have a working
US postal code validator.
15:37
Now that we've spent all that
time building the validator,
15:41
this is when the request comes in
to support international shipping.
15:44
But, don't fret.
15:47
Check this out.
15:49
Let's go ahead here and
we'll search for regular
15:55
expression postal code.
15:59
So here is a stack overflow posted and
let's go ahead and
16:06
just take a look at this.
16:10
Looking for the ultimate postal code and
zip code regex.
16:13
I'm looking for something that will cover
most and hopefully all of the world.
16:15
It's a great idea.
16:20
Let's keep on scrolling through here.
16:21
Well, here's a good one.
16:23
Let's keep on going down.
16:26
So here's one.
16:28
Here's the official one
from Postcode Data.
16:29
Try to scroll my screen up
without scrolling in there.
16:33
So there's a bunch of stuff in here.
16:35
This isn't cheating, this is avoiding
recreating the wheel, right?
16:40
Check this out too.
16:46
You can read most of these.
16:46
There's a five digit, there's a four
digit, there's one with a space,
16:47
an optimal space that has capital letters,
16:50
there are two capital
letters following it.
16:52
Pretty great, right?
16:55
So, we saw match on the string object.
16:57
Let's look at extracting some data.
17:00
So first, let's show of another
method on the string class.
17:03
It's called split.
17:06
So, let's get back to that original
problem where were parsing information for
17:07
a recruiter.
17:10
Let's pretend that we were
writing a screen scraper and
17:12
we wanted to use what was on the website
under the freeform skills section.
17:15
Let's assume that we
had something on a page
17:19
that said JavaScript, HTML, CSS and JAVA.
17:24
So, what Split is going to do for
us is to return the values split apart
17:32
by the regular expression
that we wanna provide.
17:36
So here, we wanna split on things
that aren't word characters, right?
17:38
We also wanna match on one or
more of those non-word characters.
17:42
So that means we need to not use the
asterisk symbol, which means zero or more.
17:47
We actually wanna match on one or more.
17:54
So, the not word, we wanna say not words
where there are one or more of those.
17:56
Okay?
18:02
Awesome, that looks good.
18:02
So, first things first.
18:03
Let's pop back over, and let's just
hardcode this zipcode, to be 90210,
18:05
so that will always pass.
18:10
We don't need the prompt there anymore.
18:14
And let's go ahead and
let's add the skills.
18:17
So we'll say string skills is equal to
18:20
JavaScript, HTML, CSS and Java.
18:25
Java and JavaScript are different things.
18:32
And we'll say for each of the skills in
18:36
skills.split.
18:41
And again,
that takes a regular expression.
18:45
And the regular expression that we
used on the other side was not words.
18:49
More than one of the, right?
18:53
And we've gotta remember to escape this.
18:54
Okay?
18:58
So, say for each one of those skills,
so that returns an array.
18:58
We can just go ahead and
say, system.out.printf.
19:04
And we'll say skill.
19:11
S and we'll do a new line.
19:16
And then we'll print out the skill,
each one of the skills.
19:18
So it's just gonna loop through each
of those values and then array.
19:23
Let's go ahead and run that code again.
19:26
Okay, cool.
19:28
Awesome.
19:32
So it split on the comma space.
19:32
Cool, right?
19:35
Pretty impressive and nice for
a defined separator, but
19:36
we can do a little bit better.
19:39
What if our skills
actually looked like this.
19:41
And Java.
19:44
Oh, now and is gonna be in the results.
19:48
See, if you look down here, see and
is still gonna show up there.
19:50
So let's see if we can get rid of that,
right?
19:53
So we're gonna do an optional and.
19:55
With one or more words there.
20:01
Optionally.
20:05
Cool, so let's go ahead and pop that over.
20:05
And of course, escape them.
20:16
Let's make sure the and
20:20
is [INAUDIBLE.] Oh, got to put it in here.
20:23
Great, it still works, and if somebody
was to do this and get rid of that.
20:30
There you go.
20:36
Who gives a split about
an Oxford comma anyways, right?
20:39
So split is a good way to extract
a somewhat formatted text using a Regex.
20:42
But there's even a more powerful way.
20:46
One of the problems I have when
writing scripts for courses or
20:49
workshops, is it's often easier to write
something than it is to say something.
20:51
Now, personally, I end up swallowing
words that have the sh sound in them.
20:57
So the fewer of those that I can attempt
to say, the better for all of us, right?
21:01
So, I'd like to review the words
that I've written, and
21:06
then find some helpful synonyms
before attempting to speak.
21:08
So let's find my shushes using a Regex.
21:11
Okay?
So I'm gonna copy some text here
21:15
that I have.
21:18
And paste it in here.
21:21
And I'll now bring this down.
21:26
So it says procrastination is
surely not the destination.
21:29
Should we talk about shiny things?
21:32
So the key is probably to find the noises.
21:35
So let's introduce
the logical operator or.
21:37
So s h from the shiny things or should.
21:41
Or with a single pipe ti,
from procrastination or destination.
21:46
Or the su from surely.
21:53
Cool, so that will find the matches, but
21:57
I actually really want the whole word so
I can find synonyms.
21:58
Hm.
I know.
22:04
Let's wrap it in word characters.
22:05
So we need to wrap the middle
bit in parentheses,
22:08
so it's clear what we're
actually ordering.
22:10
So we're gonna wrap these in parentheses.
22:12
And then we're gonna say,
any word character before that.
22:15
One or more word characters before that.
22:20
Or after that.
22:23
So notice how there's two underlines,
each underneath each of the words.
22:27
That's because the parentheses create
what is known as a capturing group.
22:30
If we wanted, we could extract those,
however, we won't extract the whole word.
22:35
So, let's wrap the whole
thing in parentheses.
22:42
Okay, so let's go and extract these words.
22:50
Let's, Copy this, Into Java.
22:52
And we'll say,
22:59
bring this down a little bit.
23:02
SAY String.
23:07
Script.
23:11
Up here, I've imported pattern.
23:15
So, we'll say pattern.
23:17
And then we'll just make a new
variable called pattern.
23:22
And pattern has a static
method on it called compile.
23:24
And that's where you pass in your Regex.
23:28
So we're gonna pick this Regex.
23:32
And again, we are gonna escape those w's.
23:38
Okay, so
let's loop each word that is found.
23:46
So there's a helper class,
23:51
a great little helper class, that helps
us keep track of what was matched.
23:54
So, it's the thing called matcher.
23:57
and, it keeps state.
24:00
So we'll say pattern.matcher,
24:01
and matcher takes the text
that's going to be worked on.
24:03
So you could reuse this thing.
24:06
So, we're going to pass the script, and
we're going to loop, and we're going to
24:07
say matcher.find, and this will return
true if there's a pattern that's matched.
24:10
And it will also move the state
inside of the matcher object
24:15
to the next found place.
24:20
So, what we'll want, is we'll
24:21
say system.out.printf("%s is
24:27
a shushy word because of For set S.
24:34
Okay, so matcher has,
each time it does a find,
24:44
it's going to come back and
there's a method called group.
24:48
And group, if you do group zero,
it will return the whole match,
24:53
everything that is there.
24:57
But if you do group one,
it will find the first set of parenthesis.
24:58
So the first set of
parenthesis is what we want.
25:01
So we want whatever the word was.
25:03
It was a shushy word, because of.
25:04
And then remember we have two sets of
parenthesis in our statement here, right?
25:06
We have the outside parenthesis,
that's one.
25:11
And then we have the inside,
and that's two.
25:14
So let's go ahead and see if we got it.
25:18
Let's run that script again.
25:21
I'll bring him back up.
25:25
Cool.
So procrastination is a shushy word.
25:34
And surely is a shushy word, awesome.
25:37
Perfect, we found it.
25:39
So surely capturing shells show
how beneficial regular expressions
25:40
actually are.
25:45
Whoa.
You see why I need it?
25:46
Hey, this actually gives me a chance
to show you a couple of things.
25:48
First off, let me copy from my
script what I actually said.
25:51
Surely capturing shushes.
25:54
Okay.
So
26:03
it looks like there's a couple
things that we missed here.
26:04
So we're definitely missing the ci.
26:07
And looks like there's a tu and
an si of Expressions.
26:11
Great.
26:17
So, notice how it's not catching
that surely at the front.
26:19
And that's because this is case sensitive.
26:24
Unless you tell it not to be, right?
26:27
So if I click here,
if I click this ignore case.
26:29
You will see what this actually does, is
it pops this question mark I at the front,
26:31
and that's a way of sending a flag.
26:34
Specifically into your regular expression.
26:37
And you can do that.
26:39
But as you can imagine, probably, Java
has a more specific way of doing that.
26:40
So let's go ahead and flip this over.
26:48
And the second parameter is flags.
26:51
And so there is a on pattern.
26:55
There is a constant called,
case insensitive.
26:58
Cool.
27:04
I'm gonna go ahead and
copy my script back over.
27:05
We'll change this script here,
and we'll give this a run.
27:13
There surely is counter
with a capital S there.
27:22
All right, now I hope that these regular
expressions are starting to look a little
27:26
more sensible now, and
27:30
a lot less like ancient hieroglyphics or
really bizarre emoticons.
27:31
Make sure you check out the teacher notes,
if you are hungry for more.
27:36
Hope you had a good time.
27:38
Awesome.
27:40
I'm very glad we took that time to
delve deeper into regular expressions.
27:41
Now, I'm certain you'll encounter them,
and now you're armed with information,
27:45
and they just start looking
a lot less intimidating.
27:48
Again, don't feel like you need
to fully understand these.
27:51
You'll get better with practice, and they
are definitely one of those concepts that
27:54
really clicks when you need to use them.
27:57
If you like this forma,t and would like
to see more like this, please speak up in
27:59
the forum about what you'd like to see,
and we'll do our best to make it happen.
28:03
I value your feedback and suggestions, and
28:07
am very excited to build
the content you want and need.
28:09
Thanks for hanging out, and see you soon.
28:13
You need to sign up for Treehouse in order to download course files.
Sign up