Welcome to the Treehouse Community

The Treehouse Community is a meeting place for developers, designers, and programmers of all backgrounds and skill levels to get support. Collaborate here on code errors or bugs that you need feedback on, or asking for an extra set of eyes on your latest project. Join thousands of Treehouse students and alumni in the community today. (Note: Only Treehouse students can comment or ask questions, but non-students are welcome to browse our conversations.)

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and a supportive community. Start your free trial today.

Python Regular Expressions in Python Introduction to Regular Expressions Negation

Why does [^gov\t] not get rid of the ".com" part as well as the .gov part?

I understand that it gets rid of the .gov by essentially saying "oh it seems that's theres a g, which is part of the set [gov\t], so I will ignore starting from g".

But why do the .com extensions still get found? .com contains the letter 'o', which is part of the set [gov\t]. Therefore shouldn't it go through the document, see the 'o' in the .com, and say "I'm stopping from here on out, because the letter o is part of the set I'm suppose to ignore" and just return an email with '.c' at the end instead of '.com'?

And actually, why do the @treehouse extensions also get included? 'treehouse' contains the letter 'o', so why does it not stop after 'treeh'? Since it's supposed to exclude either 'g', 'o', or 'v'.

I'm very confused!

Help would be greatly appreciated.

TLDR: [gov\t] should block both .GOV and .cOm, no?

1 Answer

Steven Parker
Steven Parker
217,506 Points

The explanation given for this particular regex seems a bit misleading. But the actual explanation is a bit complicated — let me see if I can straighten it out.

First off, the caret (^) does not exclude things from the match, but from the character class that is being defined. So '[^gov\t]" actually means "any character other than g, o, v, or tab". Then adding the "+" quatifier to it ("[^gov\t]+") means "a match must have at least one character in this position that is not a g, o, v or tab".

So there's four significant parts to this regex:

  1. @ :point_left: this exact symbol (the first "\b" is redundant since "@' is not a word symbol)
  2. [-\w.]+ :point_left: one or more "word" characters or hyphens or periods (the \d is redundant since "\w" includes digits)
  3. [^gov\t]+ :point_left: one or more characters other than g, o, v, or tab
  4. \b :point_left: a word boundary

So any match must qualify by having all 4 of these elements. So here's why "@spain." is a match:

  1. @ :point_left: this character matches
  2. spain :point_left: these characters match the second part
  3. . :point_left: just the period matches the third part
  4. the period is also a word boundary

The "gov" at the end is not included because only the "v" ends on a word boundary, and it is not part of the final character class.

And here's why "@treehouse.com" is a match:

  1. @ :point_left: this character matches
  2. treehouse.co :point_left: these characters match the second part
  3. m :point_left: just the 'm' matches the third part
  4. the 'm' is also ends on a word boundary since nothing follows it

And a regex is case-sensitive unless the "i" flag (ignore case) is given.

Detailed and great answer! Thank you. I played around and did [^v\t] and got the same result, which seems to indicate that the excluded letter being next to word boundary part plays a key role. Thanks for the answer!