Welcome to the Treehouse Community

The Treehouse Community is a meeting place for developers, designers, and programmers of all backgrounds and skill levels to get support. Collaborate here on code errors or bugs that you need feedback on, or asking for an extra set of eyes on your latest project. Join thousands of Treehouse students and alumni in the community today. (Note: Only Treehouse students can comment or ask questions, but non-students are welcome to browse our conversations.)

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and a supportive community. Start your free trial today.

Python Regular Expressions in Python Introduction to Regular Expressions Escape Hatches

Quiz quest in ESCAPE HATCHES tut wrong. Question: If I want to match the character Ö, which escape sequence should I Use

the Ö character is not a unicode character and therefore regex would match this using \W and NOT \w like your answer says.

Here is what your answer says when answering it CORRECTLY.

Bummer! \W matches anything that isn't an Unicode word character, so it would actually ignore the above character. \w will match the code, though.

5 Answers

The quiz has the correct answer.

In the cases of characters(\w), whitespace(\s), and decimals(\d) the lowercase escape matches the category. While the uppercase escape would match the absence of a unicode category.

https://docs.python.org/3/library/re.html has a useful cheat sheet in the first section

Anthony Attard
Anthony Attard
43,915 Points

Ö is an UTF-8 character. The character code is Ö

You can view more details here: http://www.periodni.com/unicode_utf-8_encoding.html#german_special_characters

Ö with the to dots above the "0" is not a unicode character, therefore the NON-unicode character "Ö" would be matched with \W (uppercase W).

Cheers, John

My local install of python matches Ö as a unicode character.

Why would you expect it to not have the unicode character property?

Python 3.4.3 (default, May  1 2015, 19:14:18) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.49)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> test = "Ö"
>>> import re
>>> re.match(r'\W', test) # returns None so not printed
>>> match = re.match(r'\w', test)
>>> match
<_sre.SRE_Match object; span=(0, 1), match='Ö'>

Hi Adam,

This is why I'm so confused.

  1. In one of the previous lessons the teacher said: "\w matches any Unicode word character, which is any letter, digit and underscore. ( a-Z,1-9,_ )

  2. When I go to http://regexpal.com/, if I type in \W (uppercase), it matches Ö. Lowercase \w does not match. Here's a screenshot on Regexpal.com showing that it matches \W (uppercase). Here's a screenshot: https://www.dropbox.com/s/jwfyc1flrxtediq/regex-uppercase-w.png?dl=0

Thanks, John

  1. The statement from the teacher is technically correct. The confusion is that the simplification ( a-Z,0-9,_ ) only applies to the ASCII encoding that most English speakers use. When a Unicode encoding is used as default in Python3 the escape matches any character that could be used in a word in a dictionary. As a side note digits and underscore are included because they get used in programming identifiers.

  2. Is an environment problem. The site calls out that they are using JavaScript regexes. Unfortunately there is only general consensus in designing mini languages like this. This is one of the cases where the language used to write the program (Python, JavaScript, etc) changes how the regex works. Even though the documentation uses the same name (Regex). The only way to avoid this is to use a Python environment for testing Regexes for use in Python.

Hope that clarifies things a little.

Well said Adam.

I'll put that in my regex notes. Thanks for taking the time to help me figure that out.

Cheers, John