My regex seems to work in VB.NET but not in Python. What am I doing wrong here?

Question

Per the challenge we're trying to return a match object containing all the phone numbers with pattern 555-555-5555. If mypattern = '\d{3}-\d{3}-\d{4} is used on sample text mytext = '111-222-3333, 666-777-8888, 555-555-5555', re.findall(mypattern, mytext) returns a list with each of the phone numbers, as expected: ['111-222-3333', '666-777-8888', '555-555-5555']
I have studied some regex prior to starting this course so I thought that mypattern could be shortened to 
mypattern = '(\d{3}-){2}\d{4}' since '\d{3}-\d{3}-' should be equivalent to '(\d{3}-){2}'
But the challenge didn't like that. Running the code in IDLE I didn't get an error but the returned match object was  ['222-', '777-', '555-']
Running the same pattern (\d{3}-){2}\d{4} against the same string my_text = '111-222-3333, 666-777-8888, 555-555-5555' in VB.NET 2010 or VBA captured each phone number in full. Why isn't the subgroup expression (\d{3}-){2} yielding a list with each of the phone numbers in full? Thank you!

Kenneth Love · Accepted Answer

Hmm, that is odd. Doing it with re.search() finds the first phone number, but re.findall() just finds the area codes, like you mentioned.
AH, I see what it is!
OK, so let's go a bit down the rabbit hole, here.
python
pattern = r'(\d{3}-){2}\d{4}`  # Match three digits followed by a hyphen, twice. Then match 4 digits.
my_string = '111-222-3333, 444-555-6666, 777-888-9999'
re.findall(pattern, my_string)
['111-', '222-', '333-']

Why just the first three numbers and hyphen for each number? Because ( ) is a capturing group and re.findall() is doing what it's supposed to do and returning us the captured groups for each match. We know this by reading the docs:
If one or more groups are present in the pattern, return a list of groups
So how do we fix that? We make the group a non-capturing group. I did not cover this in the course (on purpose) so I obviously don't expect you to solve the problem this way. But, let's see it:
python
pattern = r'(?:\d{3}-){2}\d{4}

The ?: makes it so it's a group but it doesn't act like a normal group. Again, let's go to the docs:
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
So we can't later reference the group with \1 (again, not covered in the course) and we can't get the group out with .group(1) (this is covered), but we can repeat our match like we want.
Wow, @Adiv Abramson (https://teamtreehouse.com/adivabramson), that was quite the trip you sent me on. Thanks! I didn't really understand the non-capturing group trick until now. That's awesome.
But, yes, solve the CC a different way :)

Adiv Abramson · Answer

Thank you for taking the time to investigate this matter. I'm quite puzzled that in order to return a match we have to use a non capturing group, which by definition doesn't return anything. To me at least, that's highly counterintuitive.
I copied the first code snippet provided into IDLE but got ['222-', '555-', '888-'], not ['111-', '222-', '333-'].
From the documentation on the Python MatchObject, we read that "If a group matches multiple times, only the LAST match is accessible:", which would explain why only the middle three digits of each phone number string are returned, We will never get ['111-', '444-', '777-'].
As a workaround (for me at least), I read up on the re.finditer() method  and that will retrieve the complete matching substrings, as follows:
it = re.finditer(r'(\d{3}-){2}\d{4}', my_string)
>>> for match in it:
print match.group(0)
111-222-3333
444-555-6666
777-888-9999
It's more work than using the non capturing group tip that you have presented but for me it's less counterintuitive.  Thanks a  lot!

Welcome to the Treehouse Community

Looking to learn something new?

Adiv Abramson

Adiv Abramson

My regex seems to work in VB.NET but not in Python. What am I doing wrong here?

2 Answers

Kenneth Love

Kenneth Love

Adiv Abramson

Adiv Abramson

it = re.finditer(r'(\d{3}-){2}\d{4}', my_string)

>>> for match in it:

print match.group(0)

111-222-3333

444-555-6666

777-888-9999

Kenneth Love

Kenneth Love