Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Python

Brandon Wall
Brandon Wall
5,512 Points

Help me build a Reg Ex to do something really useful!

Hey again everyone! I want to build a script to scan a bunch of XML documents and delete all the ones that do not have the tag

<profilename></profilename>

My Razer mouse keeps updating with corrupt profiles from the cloud. These profiles do not have this important tag and I've manually deleted them all, but whenever i let it reconnect to the cloud service it repopulates all these bad profiles.

I figured this would be a good exercise for me and perhaps for anyone that wants to help me?

Here is what I got so far:

import re
import os

NAGAPATH = "C:\\ProgramData\\Razer\\Synapse\\Accounts\\AM_5364380\\Devices\\Naga Epic Chroma\\Profiles"

naga_profs = os.listdir(NAGAPATH)

pattern = re.compile(r'''
(?P<profilename>[\<profilename\>])
''')

for profile in naga_profs:
    current = open(profile)
    data = current.read()
    current.close()

os.listdir() will store each file name in the directory at NAGAPATH as a list element into naga_profs

This is what I got so far. Angular brackets are used to denote group naming per reg ex convention, but in XML documents they are also used to denote tag names, so in my pattern I've escaped them, is that the correct way to get it to recognize them? Or should I use \W ? Or would the angular bracket be considered a unicode character?

I need to go through each iteration of the data variable and search the text for my pattern, if the data has it, leave it alone, if it does not, then delete it.

4 Answers

Okay so a few things you'd want to change:

First, your regex pattern is a multiline string, because you're using the 3 single quotes, so you're searching for a string with a newline character, then the group containing the tag, then another newline character. Your XML may not match that exact format, so I would suggest changing it to a single line and only using the one single quote on either end.

Also, you're using the square brackets around the string for the profilename tag, so you're basically searching for any of the characters within those square brackets. Since that includes the opening and closing angle brackets, you're likely match every XML file with that.

So try this for your pattern:

pattern = re.compile(r'(?P<profilename>\<profilename\>)')

And in fact, you don't really need a named group if you're just checking that it exists or not:

pattern = re.compile(r'\<profilename\>')

In your loop, you would just add a conditional that checks if data does not contain your pattern using the re.search method, but you'll probably need a multiline flag, and maybe some others (case insensitive?), depending on the structure and format of the XML files:

for profile in naga_profs:
    current = open(profile)
    data = current.read()
    current.close()
    if not pattern.search(data, re.M):
        # do stuff

I don't have any experience with deleting files, but it looks like the os.remove method might be the one you're looking for.

It needs a file path though, so you'd probably want to add to a new list containing the filenames/paths of the files you want to delete, then once you have that, loop through it and call os.remove and pass in the path (might need the full path, if you're not running the Python script from that directory):

corrupt_profiles = []
for profile in naga_profs:
    current = open(profile)
    data = current.read()
    current.close()
    if not pattern.search(data, re.M):
        corrupt_profiles.push(os.path.join(NAGAPATH, profile))

for path in corrupt_profiles:
    os.remove(path)

Let me know how you go with that!

Brandon Wall
Brandon Wall
5,512 Points

Oh man, you posted right as I posted, I will definitely go back and try it using a different approach now that I have a little guidance. It will be a good exercise in Reg Exs. Thank you for the help. I actually figured out that i needed to run the script in the directory with the XML files, and delete the last element of the list which would be the script itself because my open and read commands means it's opening and reading in the script itself.

I currently have a newborn screaming daughter but once i quiet her down again I will put some more thought into your reply :D

Brandon Wall
Brandon Wall
5,512 Points

I did it, i actually did it! Turns out i didnt have to use Reg Exes at all. I spent so much time trying to install this module called parsel with pip to handle XML documents, turns out i didnt need that either, but i finally did it! Tagging Kenneth Love to check it out. This is very specific to a problem I'm having but now when my mouse updates with bad profiles I can get rid of them quickly! Thanks for your awesome lessons!

import os

NAGAPATH = "C:\\Users\\brand\\Desktop\\TestProfs"

naga_profs = os.listdir(NAGAPATH)

del naga_profs[-1]

for profile in naga_profs:
    current = open(profile)
    data = current.read()
    current.close()
    if "<ProfileName>" in data:
        print("<ProfileName> Tag IN: ", profile)
    elif "<ProfileName>" not in data:
        answer = input("Deleting Naga Profile: {} due to missing tag Y/N?".format(profile))
        if answer == 'Y':
            os.remove(profile)
        elif answer == 'N':
            continue

I had to place the script in the directory so I didn't have to put full file path names in and could just use my profile variable and i used the del function to delete the last item of the list since that is the script and i wouldn't want it to delete itself now would I.

Kenneth Love
Kenneth Love
Treehouse Guest Teacher

Great job solving your problem!

If you have to deal with XML documents a lot, you probably want to look into lxml or BeautifulSoup.

os.remove won't delete a file that is currently in use anyways, but it would probably throw an exception when it tried.

Since os.listdir() returns the list in an arbitrary order, I wouldn't rely on the script being the last item. Instead, use the following (assuming your script was named script.py):

naga_profs.remove('script.py')

Unless you're 100% sure the tag names will always be in that format (TitleCase), you might want to make use of str.lower():

if "<profilename>" in data.lower():

And make it easier to answer yes or no by also converting to lower or uppercase:

if answer.upper() == 'Y':

Finally, you don't need an elif ... continue, since that is the default behaviour at the end of a block of code in a loop. You don't need to do anything unless the answer is Y.

Brandon Wall
Brandon Wall
5,512 Points

For these particular XML files all tag names are title cased. If i looked for a lowercase version would it still find it? Could I add a flag for the check be case insensitive? I believe I remember that bit in the Reg Ex lessons, the re.I flag. I didn't read that bit about the arbitrary order thing, so far every time i check the naga_profs variable it seems to list all the xml files and the .py file always comes last but I probably shouldn't count on that. I've actually made some more edits now so that the script can be run and used by others experiencing the same problem I'll post it again after I make the revisions you have suggested.

Brandon Wall
Brandon Wall
5,512 Points

Here is the RE version, the non RE version is similar. I had added a bunch of extra code to concatenate path names together and have a user enter their device name, but i realized that the name of the actual account folder may vary from user to user and is likely not static so i omitted all that complicated but slightly impressive looking code for something much simpler. By default os.listdir() with no arguments lists the directory of the folder that the script is being called from, therefore anyone facing a similar issue with their Razer Synapse compatible device need only drop the script into their folder with the profiles and run it with Python.

As you can see i call remove() twice because of the fact two scripts now exist together, one is the Reg Ex version and the other is the vanilla version :D

import os
import re


naga_profs = os.listdir()

naga_profs.remove('prof_name_check.py')
naga_profs.remove('prof_check_re.py')

pattern = re.compile(r'\<ProfileName\>')


for profile in naga_profs:
    current = open(profile)
    data = current.read()
    current.close()
    if pattern.search(data, re.M):
        print("<ProfileName> Tag IN: ", profile)
    elif not pattern.search(data, re.M):
        answer = input("Deleting Razer Profile: {} due to missing tag Y/N?".format(profile))
        if answer.upper() == 'Y':
            os.remove(profile)

Non RE:

import os

naga_profs = os.listdir()

naga_profs.remove('prof_name_check.py')
naga_profs.remove('prof_check_re.py')

for profile in naga_profs:
    current = open(profile)
    data = current.read()
    current.close()
    if "<ProfileName>" in data:
        print("<ProfileName> Tag IN: ", profile)
    elif "<ProfileName>" not in data:
        answer = input("Deleting Razer Profile: {} due to missing tag Y/N?".format(profile))
        if answer.upper() == 'Y':
            os.remove(profile)

Thank you for all of your help, feel free to leave more feedback as you see fit. I definitely feel powerful, I've managed to create something useful, I've managed to tell my computer what to do. I read something online about GUI vs Command Line interfaces that said "When we're little we use pictures and point at things, but when we grow up we learn to read and write." Tech is my passion and I'm definitely feeling like I'm learning to read and write instead of point and click, this is awesome.