Welcome to the Treehouse Community

The Treehouse Community is a meeting place for developers, designers, and programmers of all backgrounds and skill levels to get support. Collaborate here on code errors or bugs that you need feedback on, or asking for an extra set of eyes on your latest project. Join thousands of Treehouse students and alumni in the community today. (Note: Only Treehouse students can comment or ask questions, but non-students are welcome to browse our conversations.)

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and a supportive community. Start your free trial today.

Python Regular Expressions in Python Introduction to Regular Expressions Negation

Dan A
PLUS
Dan A
Courses Plus Student 4,036 Points

regex differences between Python 2.7.8 and 3.4

I have been doing all the Python courses here in 2.7.8 due to my 'next big thing' application requirements. So far I've been able to suss out differences as I come across them on my own. However, I am stumped by this one:

Following along with the video and manipulating email addresses the following code works in 3.4 (like in the video) but I get wonky matches in 2.7.8

print(re.findall(r'''
                 \b@[-\w\d.]* # match a word boundry, an 'at' symbol, and then any number characters
                 [^gov\t]+  # Ignore 1+ instances of the ketters 'g', 'o', or 'v' and a tab
                 \b  # match another word boundry
                 ''', data, re.VERBOSE | re.I))

And the results in 2.7.8:

[u'@teamtreehouse.com   (555) 555-5555  Teacher, ', u'@teamtreehouse.com  (555) 555-5554  Teacher, ', u'@camelot.co.uk       ', u'@norrbotten.co.se       ', u'@killerrabbit.com        Enchanter, Killer Rabbit ', u'@teamtreehouse.com  (555) 555-5543  ', u'@tardis.co.uk       Time ', u'@example.com  555-555-5552    Example, Example ', u'@us.gov 555 555-5551    President, United States ', u'@teamtreehouse.com    (555) 555-5553  Teacher, ', u'@empire.gov  (555) 555-4444  Sith ', u'@spain.gov     First Deputy Prime Minister, Spanish ']

Am I doing something weird? Or are there some differences I should know about. (A Google search didn't help me)

Thanks

4 Answers

Kenneth Love
STAFF
Kenneth Love
Treehouse Guest Teacher

Hmm. I just ran your line against my local Python 2.7.8 and got:

[u'@teamtreehouse.com', u'@teamtreehouse.com', u'@camelot.co.uk', u'@norrbotten.co.se', u'@killerrabbit.com', u'@teamtreehouse.com', u'@tardis.co.uk', u'@example.com', u'@us.', u'@teamtreehouse.com', u'@empire.', u'@spain.']

as the output. That looks like what I was expecting with 3.4, too. This is using the codecs code, too.

Dan A
Dan A
Courses Plus Student 4,036 Points

Sorry for the delay. Thanks for looking into this! This regex course has been very helpful!

Kenneth Love
STAFF
Kenneth Love
Treehouse Guest Teacher

I wonder if the space between re.VERBOSE and re.I is causing the issue?

Dan A
Dan A
Courses Plus Student 4,036 Points

It doesn't seem to make a difference when I take the spaces out. I put them in b/c I was getting a pep "E227 Missing whitespace around bitwise or shift operator " warning. (some of those messages are rather annoying)

Kenneth Love
Kenneth Love
Treehouse Guest Teacher

Why would that generate PEP 8 warnings? The | (the bitwise operator it's referring to) is inside of a function call, it shouldn't have spaces around it. Weird. But if it doesn't change the output, don't worry about it.

Dan A
PLUS
Dan A
Courses Plus Student 4,036 Points

I have tested the regex expression in a few web based regex testers and it seems they are also using the 2.7 interpreter as I am getting the same error results as on my local machine.

I should note that there is a difference in how I open the file using 2.7.8. In order to read unicode characters, I need to use the codecs module (or is it a package or library?). Maybe something is happening with the file handling that is causing the discrepancy. Here's the code:

import codecs
import re

names_file = codecs.open('names.txt', encoding='utf-8')
data = names_file.read()
names_file.close()

As a side question, is this the proper way to open and read a file that has unicode characters?

Jonathan Mitten
PLUS
Jonathan Mitten
Courses Plus Student 11,173 Points

I think I've solved this and others' issues with us using our console. I suspect the issue is in copying and pasting the names.txt file into a text editor that replaces tabs with spaces. My set up for editing Python in Sublime Text 3 swaps tabs with 4x spaces, rendering some of the regex rules invalid.

Instead of copying the text from the workspace, instead download the workspace and move the names.txt file into your working directory (overwrite the current file if it's still in there).

Try your regex as the movies have you do it, and see if they work as Kenneth Love says they should.