Welcome to the Treehouse Community

The Treehouse Community is a meeting place for developers, designers, and programmers of all backgrounds and skill levels to get support. Collaborate here on code errors or bugs that you need feedback on, or asking for an extra set of eyes on your latest project. Join thousands of Treehouse students and alumni in the community today. (Note: Only Treehouse students can comment or ask questions, but non-students are welcome to browse our conversations.)

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and a supportive community. Start your free trial today.

Python Regular Expressions in Python Introduction to Regular Expressions Negation

Lingjian Kong
Lingjian Kong
6,330 Points

Don't understand word boundaries

Hello,

Could someone explain the concept of word boundary

print(re.findall(r'@[-\w\d.]*[^gov\t]', data))
print(re.findall(r'\b@[-\w\d.]*[^gov\t]\b', data))

These are two different results I got.

>> ['@teamtreehouse.com', '@kennethlove\n', '@teamtreehouse.com', '@camelot.co.uk', '@norrbotten.co.se', '@sverik\n', '@killerrabbit.com', '@teamtreehouse.com', '@ryancarson\n', '@tardis.
co.uk', '@example.com', '@example\n', '@us.', '@potus44\n', '@teamtreehouse.com', '@chalkers\n', '@empire.', '@darthvader\n', '@spain.']                                                
>> ['@teamtreehouse.com', '@teamtreehouse.com', '@camelot.co.uk', '@norrbotten.co.se', '@killerrabbit.com', '@teamtreehouse.com', '@tardis.co.uk', '@example.com', '@us.', '@teamtreehouse.
com', '@empire.', '@spain.']  

Could someone explain why we have to use \b in the front and the back?

3 Answers

Chris Freeman
MOD
Chris Freeman
Treehouse Moderator 68,064 Points

A word boundary \b says "in this place, a word character is expected." In your second example, this means a word character is expected before the "@" and the last matching character must proceed a word character.

Since some of the matches in the first group end in a newline character "\n", they will be rejected by the second pattern.

The boundary character is an anchor that says a word character must be hear but it doesn't "consume" the character into the results. You may think of it like a word match character "\w" that doesn't hold on to the match results.

Hi, Chris Freeman
Based on my understanding of the python docs, I think that \b means that the expected character is not a "\w". My understanding is that \b is saying that we expect whitespace or a non-Unicode word character there. I guess people call it a word boundary because such whitespace/etc. is before or after a word.

My theory on the reason why Lingjian's second example appears to behave differently (i.e. pulling in matches where a \b is not present even though it's written in the regular expression) is because of the " * "... which says that anything before it in the raw string (and not just in the set) can be matched 0 or multiple times, making the first \b of no effect. Anyway, that's my theory... I'm still pretty new to this regex stuff : )

Thanks for asking this question, Lingjian... I was really wondering about this too!!!

Chris Freeman
Chris Freeman
Treehouse Moderator 68,064 Points

The key is the subtle difference between \b word boundary and \s whitespace and \W non-word character.

The \b is a word boundary marker. If used before a non-word character such as @, then the only possible word boundary preceding the @ would be one caused by a word ending there. That is, a word character immediately before the @. In the case of the twitter handles, it would not match.

Granted, the problem of exactly what to put in the regex might have been tougher if the twitter handle wasn't the last item in the line. But since it is the last item, it is sufficient to anchor the pattern with a end-of-line marker $.

The issue with using \s or \W, is they would become part of the matched string unless regex groups notation is added. You'll see regex groups later on in the course.

Bronson Avila
Bronson Avila
4,160 Points

For anyone else reading this question, I can understand how the code shown in this exercise appears confusing. When Kenneth defined a word boundary in the Escape Hatches video, he specifically said a word boundary is, quote, "It's the edges of a word, defined by white space or the edges of a screen."

This definition may be misleading because it suggests that a word boundary cannot existing between two non-white space characters in a string. However, a word boundary can in fact exist under such circumstances, as one source notes that a word boundary can occur "between two characters in the string, where one is a word character and the other is not a word character."

So in the case of an email address such as "sender@address.com", all of the characters up until the "@" symbol are word characters, while the @ symbol itself is not a word character. Thus, the "gap" between "sender" and "@" constitutes a word boundary.

Chris Freeman
Chris Freeman
Treehouse Moderator 68,064 Points

Good points. In terms of the “gap”, I would add that a word boundary \b is a “zero length” matching element that matches the condition of a word boundary, but doesn’t not consume any characters.

Ohhhhh, ok- I was wrong... thanks for steering me to the correct answer, Chris!!