Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Python

Regular Expression for Credit Cards

Hello,

I'm doing an independent project where I'm trying to extract telephone numbers and credit card numbers from multiple text files and then compiling them into another file. To this end, I found some regex cookbooks to try to capture the most commonly found credit card numbers alongside with a regex key for capturing phone numbers.

I was able to extract the phone numbers successfully but I'm having difficulties getting the credit card numbers. Can someone explain to me what is the issue I'm having with my re.findall() method?

For clarity's sake, whenever I run the re.findall() method, the script returns back blank brackets. I tested it against three text files containing these numbers:

testfile1.txt = {215-763-6263 
5544 5533 5533 6633
321-456-9824
4234 9832 8932 8921}

testfile2.txt = {456-674-2312, 
5678 9854 9454 4543 ,
2356 6435 0984, 
421-567-3232 }

testfile3.txt = {378282246310005 
371449635398431 
378734493671000 
30569309025904
38520000023237 
6011111111111117 
6011000990139424
555555555554444
5105105105105100 
4111111111111111 
4012888888881881
4222222222222 }
import os

import re

path = "./test"
our_library = os.listdir(path)

ccardlist = open("credit card numbers.txt", 'w')
telephonelist = open("telephone numbers.txt", 'w')

for item in our_library:
    file = os.path.join(path, item)
    txt = open(file, 'r')
    data = txt.read()


    telephone = re.findall(r'\(?\d{3}\)?-?\s?\d{3}-\d{4}', data)

    ccard = re.findall(r'''4\d{12}(\d{3})?
                        (5[1-5]\d{4}|677189)\d{10}
                        3(0[0-5]|[68]\d)\d{11}
                        3[47]\d{13}
                        (6011|65\d{2})\d{12}
                        ''', data, re.X)


    ccardlist.write("The list of credit card numbers in {} is : \n".format(file) + str(ccard) + "\n\n")

    telephonelist.write("The list of telephone numbers in {} is : \n".format(file) + str(telephone) + "\n\n")

    txt.close()
Chris Freeman
Chris Freeman
Treehouse Moderator 68,468 Points

The structure of your data is unclear when you use the set assignment syntax. Are the numbers simply plain text on different lines?

3 Answers

Chris Freeman
MOD
Chris Freeman
Treehouse Moderator 68,468 Points

A regex must match all or nothing. The findall pattern contains many numbers following each other. The only way to match is if all of the targeted numbers were in that exact sequence. For example, I first flatted out your patterns a bit for clarity:

    ccard = re.findall(r'''4\d{12}\d{3}?
                        5[1-5]\d{4}\d{10}
                        677189\d{10}
                        30[0-5]\d{11}
                        3[68]\d\d{11}
                        3[47]\d{13}
                        6011\d{12}
                        65\d{2}\d{12}
                        ''', data, re.X)

Since the re.X flag simply allows spreading the regex across multiple lines, it is actually the same as:

    ccard2 = re.findall(r'''4\d{12}\d{3}?5[1-5]\d{4}\d{10}677189\d{10}30[0-5]\d{11}3[68]\d\d{11}3[47]\d{13}6011\d{12}65\d{2}\d{12}''', data, re.X)

Which would only match something like this:

400000000000011151000011111111116771890000000000305000000000003601111111111134000000000000060110000000000006500111111111111

You can break it up into an OR'd listed of possible patterns by adding the or symbol "|" at the end of each line:

    ccard = re.findall(r'''4\d{12}\d{3}?|
                        5[1-5]\d{4}\d{10}|
                        677189\d{10}|
                        30[0-5]\d{11}|
                        3[68]\d\d{11}|
                        3[47]\d{13}|
                        6011\d{12}|
                        65\d{2}\d{12}
                        ''', data, re.X)

This will produce the output file:

The list of credit card numbers in ./test/testfile1.txt is : 
[]

The list of credit card numbers in ./test/testfile2.txt is : 
[]

The list of credit card numbers in ./test/testfile3.txt is : 
['378282246310005', '371449635398431', '378734493671000', '30569309025904', '38520000023237', '6011111111111117', '6011000990139424', '5105105105105100', '4111111111111111', '4012888888881881']

If you wanted to also get the ccard numbers from testfile1.txt, change the patterns to include the optional spaces between groups of four. This pattern will match the card numbers in testfile1.txt:

    ccard = re.findall(r'''4\d{3}\s?\d{4}\s?\d{4}\s?\d{1}\d{3}?|
                        5[1-5]\d{2}\s?\d{4}\s?\d{4}\s?\d{4}|
                        677189\d{10}|
                        30[0-5]\d{11}|
                        3[68]\d\d{11}|
                        3[47]\d{13}|
                        6011\d{12}|
                        65\d{2}\d{12}
                        ''', data, re.X)

produces:

The list of credit card numbers in ./test/testfile1.txt is : 
['5544 5533 5533 6633', '4234 9832 8932 8921']

Additionally, it is uncommon to mix string format() with concatenation. The two can be combined:

    ccardlist.write("The list of credit card numbers in {} is : \n{}\n\n"
                    "".format(file, ccard))  # <-- a convenient way to wrap a format line

Chris,

Thanks for your response. I'm just trying to write a simple script where I can look through a litany of emails loaded with telephone numbers and credit card numbers, and write those numbers into different .txt files. The emails are stored as .txt in a folder I've named test.

I just generated those numbers at random and found others to test against.

-Christopher

Chris,

Thanks so much for your help! This makes so much sense.

-Christopher