Welcome to the Treehouse Community
Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.
Looking to learn something new?
Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.
Start your free trialBrandon Wall
5,512 PointsHelp me build a Reg Ex to do something really useful!
Hey again everyone! I want to build a script to scan a bunch of XML documents and delete all the ones that do not have the tag
<profilename></profilename>
My Razer mouse keeps updating with corrupt profiles from the cloud. These profiles do not have this important tag and I've manually deleted them all, but whenever i let it reconnect to the cloud service it repopulates all these bad profiles.
I figured this would be a good exercise for me and perhaps for anyone that wants to help me?
Here is what I got so far:
import re
import os
NAGAPATH = "C:\\ProgramData\\Razer\\Synapse\\Accounts\\AM_5364380\\Devices\\Naga Epic Chroma\\Profiles"
naga_profs = os.listdir(NAGAPATH)
pattern = re.compile(r'''
(?P<profilename>[\<profilename\>])
''')
for profile in naga_profs:
current = open(profile)
data = current.read()
current.close()
os.listdir() will store each file name in the directory at NAGAPATH as a list element into naga_profs
This is what I got so far. Angular brackets are used to denote group naming per reg ex convention, but in XML documents they are also used to denote tag names, so in my pattern I've escaped them, is that the correct way to get it to recognize them? Or should I use \W ? Or would the angular bracket be considered a unicode character?
I need to go through each iteration of the data variable and search the text for my pattern, if the data has it, leave it alone, if it does not, then delete it.
4 Answers
Iain Simmons
Treehouse Moderator 32,305 PointsOkay so a few things you'd want to change:
First, your regex pattern is a multiline string, because you're using the 3 single quotes, so you're searching for a string with a newline character, then the group containing the tag, then another newline character. Your XML may not match that exact format, so I would suggest changing it to a single line and only using the one single quote on either end.
Also, you're using the square brackets around the string for the profilename
tag, so you're basically searching for any of the characters within those square brackets. Since that includes the opening and closing angle brackets, you're likely match every XML file with that.
So try this for your pattern:
pattern = re.compile(r'(?P<profilename>\<profilename\>)')
And in fact, you don't really need a named group if you're just checking that it exists or not:
pattern = re.compile(r'\<profilename\>')
In your loop, you would just add a conditional that checks if data
does not contain your pattern
using the re.search
method, but you'll probably need a multiline flag, and maybe some others (case insensitive?), depending on the structure and format of the XML files:
for profile in naga_profs:
current = open(profile)
data = current.read()
current.close()
if not pattern.search(data, re.M):
# do stuff
I don't have any experience with deleting files, but it looks like the os.remove method might be the one you're looking for.
It needs a file path though, so you'd probably want to add to a new list containing the filenames/paths of the files you want to delete, then once you have that, loop through it and call os.remove
and pass in the path (might need the full path, if you're not running the Python script from that directory):
corrupt_profiles = []
for profile in naga_profs:
current = open(profile)
data = current.read()
current.close()
if not pattern.search(data, re.M):
corrupt_profiles.push(os.path.join(NAGAPATH, profile))
for path in corrupt_profiles:
os.remove(path)
Let me know how you go with that!
Brandon Wall
5,512 PointsI did it, i actually did it! Turns out i didnt have to use Reg Exes at all. I spent so much time trying to install this module called parsel with pip to handle XML documents, turns out i didnt need that either, but i finally did it! Tagging Kenneth Love to check it out. This is very specific to a problem I'm having but now when my mouse updates with bad profiles I can get rid of them quickly! Thanks for your awesome lessons!
import os
NAGAPATH = "C:\\Users\\brand\\Desktop\\TestProfs"
naga_profs = os.listdir(NAGAPATH)
del naga_profs[-1]
for profile in naga_profs:
current = open(profile)
data = current.read()
current.close()
if "<ProfileName>" in data:
print("<ProfileName> Tag IN: ", profile)
elif "<ProfileName>" not in data:
answer = input("Deleting Naga Profile: {} due to missing tag Y/N?".format(profile))
if answer == 'Y':
os.remove(profile)
elif answer == 'N':
continue
I had to place the script in the directory so I didn't have to put full file path names in and could just use my profile variable and i used the del function to delete the last item of the list since that is the script and i wouldn't want it to delete itself now would I.
Kenneth Love
Treehouse Guest TeacherGreat job solving your problem!
If you have to deal with XML documents a lot, you probably want to look into lxml or BeautifulSoup.
Iain Simmons
Treehouse Moderator 32,305 Pointsos.remove
won't delete a file that is currently in use anyways, but it would probably throw an exception when it tried.
Since os.listdir()
returns the list in an arbitrary order, I wouldn't rely on the script being the last item. Instead, use the following (assuming your script was named script.py
):
naga_profs.remove('script.py')
Unless you're 100% sure the tag names will always be in that format (TitleCase), you might want to make use of str.lower()
:
if "<profilename>" in data.lower():
And make it easier to answer yes or no by also converting to lower or uppercase:
if answer.upper() == 'Y':
Finally, you don't need an elif ... continue
, since that is the default behaviour at the end of a block of code in a loop. You don't need to do anything unless the answer is Y
.
Brandon Wall
5,512 PointsFor these particular XML files all tag names are title cased. If i looked for a lowercase version would it still find it? Could I add a flag for the check be case insensitive? I believe I remember that bit in the Reg Ex lessons, the re.I flag. I didn't read that bit about the arbitrary order thing, so far every time i check the naga_profs variable it seems to list all the xml files and the .py file always comes last but I probably shouldn't count on that. I've actually made some more edits now so that the script can be run and used by others experiencing the same problem I'll post it again after I make the revisions you have suggested.
Brandon Wall
5,512 PointsHere is the RE version, the non RE version is similar. I had added a bunch of extra code to concatenate path names together and have a user enter their device name, but i realized that the name of the actual account folder may vary from user to user and is likely not static so i omitted all that complicated but slightly impressive looking code for something much simpler. By default os.listdir() with no arguments lists the directory of the folder that the script is being called from, therefore anyone facing a similar issue with their Razer Synapse compatible device need only drop the script into their folder with the profiles and run it with Python.
As you can see i call remove() twice because of the fact two scripts now exist together, one is the Reg Ex version and the other is the vanilla version :D
import os
import re
naga_profs = os.listdir()
naga_profs.remove('prof_name_check.py')
naga_profs.remove('prof_check_re.py')
pattern = re.compile(r'\<ProfileName\>')
for profile in naga_profs:
current = open(profile)
data = current.read()
current.close()
if pattern.search(data, re.M):
print("<ProfileName> Tag IN: ", profile)
elif not pattern.search(data, re.M):
answer = input("Deleting Razer Profile: {} due to missing tag Y/N?".format(profile))
if answer.upper() == 'Y':
os.remove(profile)
Non RE:
import os
naga_profs = os.listdir()
naga_profs.remove('prof_name_check.py')
naga_profs.remove('prof_check_re.py')
for profile in naga_profs:
current = open(profile)
data = current.read()
current.close()
if "<ProfileName>" in data:
print("<ProfileName> Tag IN: ", profile)
elif "<ProfileName>" not in data:
answer = input("Deleting Razer Profile: {} due to missing tag Y/N?".format(profile))
if answer.upper() == 'Y':
os.remove(profile)
Thank you for all of your help, feel free to leave more feedback as you see fit. I definitely feel powerful, I've managed to create something useful, I've managed to tell my computer what to do. I read something online about GUI vs Command Line interfaces that said "When we're little we use pictures and point at things, but when we grow up we learn to read and write." Tech is my passion and I'm definitely feeling like I'm learning to read and write instead of point and click, this is awesome.
Brandon Wall
5,512 PointsBrandon Wall
5,512 PointsOh man, you posted right as I posted, I will definitely go back and try it using a different approach now that I have a little guidance. It will be a good exercise in Reg Exs. Thank you for the help. I actually figured out that i needed to run the script in the directory with the XML files, and delete the last element of the list which would be the script itself because my open and read commands means it's opening and reading in the script itself.
I currently have a newborn screaming daughter but once i quiet her down again I will put some more thought into your reply :D