Why does [^gov ] not get rid of the ".com" part as well as the .gov part?

Question

I understand that it gets rid of the .gov by essentially saying "oh it seems that's theres a g, which is part of the set [gov	], so I will ignore starting from g".
But why do the .com extensions still get found? .com contains the letter 'o', which is part of the set [gov	]. Therefore shouldn't it go through the document, see the 'o' in the .com, and say "I'm stopping from here on out, because the letter o is part of the set I'm suppose to ignore" and just return an email with '.c' at the end instead of '.com'?
And actually, why do the @treehouse extensions also get included? 'treehouse' contains the letter 'o', so why does it not stop after 'treeh'? Since it's supposed to exclude either 'g', 'o', or 'v'.
I'm very confused!
Help would be greatly appreciated.
TLDR: [gov	] should block both .GOV and .cOm, no?

Steven Parker · Accepted Answer

The explanation given for this particular regex seems a bit misleading.  But the actual explanation is a bit complicated — let me see if I can straighten it out.
First off, the caret (^) does not exclude things from the match, but from the character class that is being defined.  So '[^gov	]" actually means "any character other than g, o, v, or tab".  Then adding the "+" quatifier to it ("[^gov	]+") means "a match must  have at least one character in this position that is not a g, o, v or tab".
So there's four significant parts to this regex:
1.  @ :pointleft: this exact symbol (the first "\b" is redundant since "@' is not a word symbol)
2. [-\w.]+ :pointleft: one or more "word" characters or hyphens or periods (the \d is redundant since "\w" includes digits)
3. [^gov	]+ :pointleft: one or more characters _other than g, o, v, or tab
4. \b :point_left: a word boundary
So any match must qualify by having all 4 of these elements.   So  here's why "@spain." is a match:
1. @ :pointleft: this character matches
2. spain :pointleft: these characters match the second part
3. . :point_left: just the period matches the third part
4. the period is also a word boundary
The "gov" at the end is not included because only the "v" ends on a word boundary, and it is not part of the final character class.
And  here's why "@treehouse.com" is a match:
1. @ :pointleft: this character matches
2. treehouse.co :pointleft: these characters match the second part
3. m :point_left: just the 'm' matches the third part
4. the 'm' is also ends on a word boundary since nothing follows it
And a regex is case-sensitive unless the "i" flag (ignore case) is given.

Welcome to the Treehouse Community

Looking to learn something new?

Serdar Halac

Serdar Halac

Why does [^gov\t] not get rid of the ".com" part as well as the .gov part?

1 Answer

Steven Parker

Steven Parker

Serdar Halac

Serdar Halac