Regular Expression for Credit Cards

Question

Hello,

I'm doing an independent project where I'm trying to extract telephone numbers and credit card numbers from multiple text files and then compiling them into another file. To this end, I found some regex cookbooks to try to capture the most commonly found credit card numbers alongside with a regex key for capturing phone numbers.

I was able to extract the phone numbers successfully but I'm having difficulties getting the credit card numbers. Can someone explain to me what is the issue I'm having with my re.findall() method?

For clarity's sake, whenever I run the re.findall() method, the script returns back blank brackets. I tested it against three text files containing these numbers:

testfile1.txt = {215-763-6263 
5544 5533 5533 6633
321-456-9824
4234 9832 8932 8921}

testfile2.txt = {456-674-2312, 
5678 9854 9454 4543 ,
2356 6435 0984, 
421-567-3232 }

testfile3.txt = {378282246310005 
371449635398431 
378734493671000 
30569309025904
38520000023237 
6011111111111117 
6011000990139424
555555555554444
5105105105105100 
4111111111111111 
4012888888881881
4222222222222 }

import os

import re

path = "./test"
our_library = os.listdir(path)

ccardlist = open("credit card numbers.txt", 'w')
telephonelist = open("telephone numbers.txt", 'w')

for item in our_library:
    file = os.path.join(path, item)
    txt = open(file, 'r')
    data = txt.read()


    telephone = re.findall(r'\(?\d{3}\)?-?\s?\d{3}-\d{4}', data)

    ccard = re.findall(r'''4\d{12}(\d{3})?
                        (5[1-5]\d{4}|677189)\d{10}
                        3(0[0-5]|[68]\d)\d{11}
                        3[47]\d{13}
                        (6011|65\d{2})\d{12}
                        ''', data, re.X)


    ccardlist.write("The list of credit card numbers in {} is : \n".format(file) + str(ccard) + "\n\n")

    telephonelist.write("The list of telephone numbers in {} is : \n".format(file) + str(telephone) + "\n\n")

    txt.close()

Answer 1 · 2016-03-09T06:10:02Z

March 9, 2016 6:10am

A regex must match all or nothing. The findall pattern contains many numbers following each other. The only way to match is if all of the targeted numbers were in that exact sequence. For example, I first flatted out your patterns a bit for clarity:

    ccard = re.findall(r'''4\d{12}\d{3}?
                        5[1-5]\d{4}\d{10}
                        677189\d{10}
                        30[0-5]\d{11}
                        3[68]\d\d{11}
                        3[47]\d{13}
                        6011\d{12}
                        65\d{2}\d{12}
                        ''', data, re.X)

Since the re.X flag simply allows spreading the regex across multiple lines, it is actually the same as:

    ccard2 = re.findall(r'''4\d{12}\d{3}?5[1-5]\d{4}\d{10}677189\d{10}30[0-5]\d{11}3[68]\d\d{11}3[47]\d{13}6011\d{12}65\d{2}\d{12}''', data, re.X)

Which would only match something like this:

400000000000011151000011111111116771890000000000305000000000003601111111111134000000000000060110000000000006500111111111111

You can break it up into an OR'd listed of possible patterns by adding the or symbol "|" at the end of each line:

    ccard = re.findall(r'''4\d{12}\d{3}?|
                        5[1-5]\d{4}\d{10}|
                        677189\d{10}|
                        30[0-5]\d{11}|
                        3[68]\d\d{11}|
                        3[47]\d{13}|
                        6011\d{12}|
                        65\d{2}\d{12}
                        ''', data, re.X)

This will produce the output file:

The list of credit card numbers in ./test/testfile1.txt is : 
[]

The list of credit card numbers in ./test/testfile2.txt is : 
[]

The list of credit card numbers in ./test/testfile3.txt is : 
['378282246310005', '371449635398431', '378734493671000', '30569309025904', '38520000023237', '6011111111111117', '6011000990139424', '5105105105105100', '4111111111111111', '4012888888881881']

If you wanted to also get the ccard numbers from testfile1.txt, change the patterns to include the optional spaces between groups of four. This pattern will match the card numbers in testfile1.txt:

    ccard = re.findall(r'''4\d{3}\s?\d{4}\s?\d{4}\s?\d{1}\d{3}?|
                        5[1-5]\d{2}\s?\d{4}\s?\d{4}\s?\d{4}|
                        677189\d{10}|
                        30[0-5]\d{11}|
                        3[68]\d\d{11}|
                        3[47]\d{13}|
                        6011\d{12}|
                        65\d{2}\d{12}
                        ''', data, re.X)

produces:

The list of credit card numbers in ./test/testfile1.txt is : 
['5544 5533 5533 6633', '4234 9832 8932 8921']

Additionally, it is uncommon to mix string format() with concatenation. The two can be combined:

    ccardlist.write("The list of credit card numbers in {} is : \n{}\n\n"
                    "".format(file, ccard))  # <-- a convenient way to wrap a format line

Answer 2 · 2016-03-08T21:14:10Z

March 8, 2016 9:14pm

Chris,

Thanks for your response. I'm just trying to write a simple script where I can look through a litany of emails loaded with telephone numbers and credit card numbers, and write those numbers into different .txt files. The emails are stored as .txt in a folder I've named test.

I just generated those numbers at random and found others to test against.

-Christopher

Answer 3 · 2016-03-09T13:05:20Z

March 9, 2016 1:05pm

Chris,

Thanks so much for your help! This makes so much sense.

-Christopher

Welcome to the Treehouse Community

Looking to learn something new?

Christopher Wall

Christopher Wall

Regular Expression for Credit Cards

Chris Freeman

Chris Freeman

3 Answers

Chris Freeman

Chris Freeman

Christopher Wall

Christopher Wall

Christopher Wall

Christopher Wall