Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Python Regular Expressions in Python Introduction to Regular Expressions Sets

Regex for teamtreehouse or treehouse?

Using the names.txt file for this video which contains several strings of "treehouse" as well as several strings containing "teamtreehouse", how would I write a regex that gave me either?

I guess I'm confused about when and wether to use string literals inside the regex.

I can find all occurrences of 'treehouse' with this:

print(re.findall(r'\b[trehous]{9}\b', data, re.I))

as given in the video. But why can't I do this:

print(re.findall(r'(team)*[trehous]{9}\b', data, re.I))

to find either treehouse or teamtreehouse? The asterisk means zero or more occurrences, so I thought (team)* would match zero or more occurrences of 'team'.

I could do something cumbersome like:

print(re.findall(r'(teamtreehouse)|(treehouse)', data, re.I))

but that returns a series of tuples which include a space character, why?

 keith@ada:~/code/py$ python address_book.py
[('teamtreehouse', ''), ('', 'Treehouse'), ('teamtreehouse', ''), ('', 'Treehouse'), ('teamtreehouse', ''), ('', 'Treehouse'), ('teamtreehouse', ''), ('', 'Treehouse')]

What's the correct syntax?

Thanks, Keith

5 Answers

Greg Kaleka
Greg Kaleka
39,021 Points

I like this question. You made me work for it. This took some digging, but of course, I found the answer in the documentation.

If you spend enough time in there, you'll find the section on this:

(?:...)

A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

This is what we want. We want to group (team) without actually capturing it and returning it with the match. Also one other nitpick - you should probably use ? instead of *, since we want 0 or 1 instance of "team", not 0 or more (we shouldn't match teamteamteamtreehouse, for example).

solution.py
print(re.findall(r'\b(?:team)?[trehous]{9}\b', data, re.I))

Of course, now that I've discovered this, it seems ridiculous to use the pattern \b[trehous]{9}\b to match "treehouse". We can use our new tool!

cleaner.py
print(re.findall(r'\b(?:team)?(?:treehouse)\b', data, re.I))

or if we just wanted treehouses:

original_search.py
print(re.findall(r'\b(?:treehouse)\b', data, re.I))

So much better!

Cheers :beers:

-Greg

Wow, Greg, nice work! Thanks much. Yes indeed, it is always rewarding to find new ideas when solving a problem! I had been looking at the documentation, but missed that little nugget (possibly because I was studying an older page for version 3.5 which words thing differently than the page you list that is for 3.62.)

I'll add one thing I did find in case it might be useful to someone else: a nice online regex learning tool: https://regex101.com/

Greg Kaleka
Greg Kaleka
39,021 Points

a nice online regex learning tool: https://regex101.com/

Hah I have this open in the tab immediately to the right of this one :wink:

james south
seal-mask
.a{fill-rule:evenodd;}techdegree seal-36
james south
Front End Web Development Techdegree Graduate 33,271 Points

i think you are close with the capture group (parens) but don't put treehouse in brackets and don't use a quantifier (curly braces). (team)*treehouse ought to work. then to also pull in caps you would use brackets and pipe, like ([t|T]eam)*.... should get team or Team etc, same with the treehouse part, [t|T]reehouse.

print(re.findall(r'(team)treehouse', data, re.I))

returns

(py) keith@ada:~/code/py$ python address_book.py
['team', 'team', 'team', 'team']
james south
seal-mask
.a{fill-rule:evenodd;}techdegree seal-36
james south
Front End Web Development Techdegree Graduate 33,271 Points

did you try with an asterisk after the capture group? i edited my answer because in markdown asterisk denotes the beginning of an italicized section, so it originally had no asterisks but a sentence in italics that was between the two asterisks you see after my edit, where i escaped them. with /(team)*treehouse/gi i get treehouse and teamtreehouse but not team.

Thanks for helping out James. I have tried using the asterisk before (as shown above) but it doesn't give me the intended result:

print(re.findall(r'([T|t]eam)*[T|t]reehouse', data, re.I))

returns

keith@ada:~/code/py$ python address_book.py
['team', '', 'team', '', 'team', '', 'team', '']

I've also tried:

print(re.findall(r'([T|t]eam)*([T|t]reehouse)', data, re.I))

which returns:

keith@ada:~/code/py$ python address_book.py
[('team', 'treehouse'), ('', 'Treehouse'), ('team', 'treehouse'), ('', 'Treehouse'), ('team', 'treehouse'), ('', 'Treehouse'), ('team', 'treehouse'), ('', 'Treehouse')]