Welcome to the Treehouse Community
Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.
Looking to learn something new?
Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.
Start your free trial
Adiv Abramson
6,919 PointsMy regex seems to work in VB.NET but not in Python. What am I doing wrong here?
Per the challenge we're trying to return a match object containing all the phone numbers with pattern 555-555-5555. If my_pattern = '\d{3}-\d{3}-\d{4} is used on sample text my_text = '111-222-3333, 666-777-8888, 555-555-5555', re.findall(my_pattern, my_text) returns a list with each of the phone numbers, as expected: ['111-222-3333', '666-777-8888', '555-555-5555']
I have studied some regex prior to starting this course so I thought that my_pattern could be shortened to my_pattern = '(\d{3}-){2}\d{4}' since '\d{3}-\d{3}-' should be equivalent to '(\d{3}-){2}'
But the challenge didn't like that. Running the code in IDLE I didn't get an error but the returned match object was ['222-', '777-', '555-']
Running the same pattern (\d{3}-){2}\d{4} against the same string my_text = '111-222-3333, 666-777-8888, 555-555-5555' in VB.NET 2010 or VBA captured each phone number in full. Why isn't the subgroup expression (\d{3}-){2} yielding a list with each of the phone numbers in full? Thank you!
2 Answers
Kenneth Love
Treehouse Guest TeacherHmm, that is odd. Doing it with re.search() finds the first phone number, but re.findall() just finds the area codes, like you mentioned.
AH, I see what it is!
OK, so let's go a bit down the rabbit hole, here.
pattern = r'(\d{3}-){2}\d{4}` # Match three digits followed by a hyphen, twice. Then match 4 digits.
my_string = '111-222-3333, 444-555-6666, 777-888-9999'
re.findall(pattern, my_string)
['111-', '222-', '333-']
Why just the first three numbers and hyphen for each number? Because ( ) is a capturing group and re.findall() is doing what it's supposed to do and returning us the captured groups for each match. We know this by reading the docs:
If one or more groups are present in the pattern, return a list of groups
So how do we fix that? We make the group a non-capturing group. I did not cover this in the course (on purpose) so I obviously don't expect you to solve the problem this way. But, let's see it:
pattern = r'(?:\d{3}-){2}\d{4}
The ?: makes it so it's a group but it doesn't act like a normal group. Again, let's go to the docs:
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
So we can't later reference the group with \1 (again, not covered in the course) and we can't get the group out with .group(1) (this is covered), but we can repeat our match like we want.
Wow, Adiv Abramson, that was quite the trip you sent me on. Thanks! I didn't really understand the non-capturing group trick until now. That's awesome.
But, yes, solve the CC a different way :)
Adiv Abramson
6,919 PointsThank you for taking the time to investigate this matter. I'm quite puzzled that in order to return a match we have to use a non capturing group, which by definition doesn't return anything. To me at least, that's highly counterintuitive.
I copied the first code snippet provided into IDLE but got ['222-', '555-', '888-'], not ['111-', '222-', '333-'].
From the documentation on the Python MatchObject, we read that "If a group matches multiple times, only the LAST match is accessible:", which would explain why only the middle three digits of each phone number string are returned, We will never get ['111-', '444-', '777-'].
As a workaround (for me at least), I read up on the re.finditer() method and that will retrieve the complete matching substrings, as follows:
it = re.finditer(r'(\d{3}-){2}\d{4}', my_string)
>>> for match in it:
print match.group(0)
111-222-3333
444-555-6666
777-888-9999
It's more work than using the non capturing group tip that you have presented but for me it's less counterintuitive. Thanks a lot!
Kenneth Love
Treehouse Guest TeacherYou'll be pleasantly surprised by the last video, then, where we use .finditer().