regex differences between Python 2.7.8 and 3.4

Question

I have been doing all the Python courses here in 2.7.8 due to my 'next big thing' application requirements. So far I've been able to suss out differences as I come across them on my own. However, I am stumped by this one:

Following along with the video and manipulating email addresses the following code works in 3.4 (like in the video) but I get wonky matches in 2.7.8

print(re.findall(r'''
                 \b@[-\w\d.]* # match a word boundry, an 'at' symbol, and then any number characters
                 [^gov\t]+  # Ignore 1+ instances of the ketters 'g', 'o', or 'v' and a tab
                 \b  # match another word boundry
                 ''', data, re.VERBOSE | re.I))

And the results in 2.7.8:

[u'@teamtreehouse.com   (555) 555-5555  Teacher, ', u'@teamtreehouse.com  (555) 555-5554  Teacher, ', u'@camelot.co.uk       ', u'@norrbotten.co.se       ', u'@killerrabbit.com        Enchanter, Killer Rabbit ', u'@teamtreehouse.com  (555) 555-5543  ', u'@tardis.co.uk       Time ', u'@example.com  555-555-5552    Example, Example ', u'@us.gov 555 555-5551    President, United States ', u'@teamtreehouse.com    (555) 555-5553  Teacher, ', u'@empire.gov  (555) 555-4444  Sith ', u'@spain.gov     First Deputy Prime Minister, Spanish ']

Am I doing something weird? Or are there some differences I should know about. (A Google search didn't help me)

Thanks

Answer 1 · 2014-12-22T19:19:09Z

December 22, 2014 7:19pm

Hmm. I just ran your line against my local Python 2.7.8 and got:

[u'@teamtreehouse.com', u'@teamtreehouse.com', u'@camelot.co.uk', u'@norrbotten.co.se', u'@killerrabbit.com', u'@teamtreehouse.com', u'@tardis.co.uk', u'@example.com', u'@us.', u'@teamtreehouse.com', u'@empire.', u'@spain.']

as the output. That looks like what I was expecting with 3.4, too. This is using the codecs code, too.

Answer 2 · 2014-12-20T20:50:08Z

December 20, 2014 8:50pm

I wonder if the space between re.VERBOSE and re.I is causing the issue?

Answer 3 · 2014-12-20T22:35:00Z

December 20, 2014 10:35pm

I have tested the regex expression in a few web based regex testers and it seems they are also using the 2.7 interpreter as I am getting the same error results as on my local machine.

I should note that there is a difference in how I open the file using 2.7.8. In order to read unicode characters, I need to use the codecs module (or is it a package or library?). Maybe something is happening with the file handling that is causing the discrepancy. Here's the code:

import codecs
import re

names_file = codecs.open('names.txt', encoding='utf-8')
data = names_file.read()
names_file.close()

As a side question, is this the proper way to open and read a file that has unicode characters?

Answer 4 · 2017-11-06T20:34:41Z

November 6, 2017 8:34pm

I think I've solved this and others' issues with us using our console. I suspect the issue is in copying and pasting the names.txt file into a text editor that replaces tabs with spaces. My set up for editing Python in Sublime Text 3 swaps tabs with 4x spaces, rendering some of the regex rules invalid.

Instead of copying the text from the workspace, instead download the workspace and move the names.txt file into your working directory (overwrite the current file if it's still in there).

Try your regex as the movies have you do it, and see if they work as Kenneth Love says they should.

Welcome to the Treehouse Community

Looking to learn something new?

Dan A

Dan A

regex differences between Python 2.7.8 and 3.4

4 Answers

Kenneth Love

Kenneth Love

Dan A

Dan A

Kenneth Love

Kenneth Love

Dan A

Dan A

Kenneth Love

Kenneth Love

Dan A

Dan A

Jonathan Mitten

Jonathan Mitten