Welcome to the Treehouse Community

The Treehouse Community is a meeting place for developers, designers, and programmers of all backgrounds and skill levels to get support. Collaborate here on code errors or bugs that you need feedback on, or asking for an extra set of eyes on your latest project. Join thousands of Treehouse students and alumni in the community today. (Note: Only Treehouse students can comment or ask questions, but non-students are welcome to browse our conversations.)

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and a supportive community. Start your free trial today.

Data Analysis Cleaning and Preparing Data Handling Bad Data Simple Data Issues

Cody Stephenson
Cody Stephenson
8,272 Points

Here are my function-based attempts to trim whitespace and get other information.

I didn't want to go through every potentially bad column one at a time so I tried to organize some of the information first.

# Let's turn the column names to a list for ease of use
obj_columns = demo.select_dtypes(include=['object'])
obj_columns.columns

Gives us

Index(['RIDSTATR', 'RIAGENDR', 'RIDRETH1', 'DMQMILIT', 'DMDBORN', 'DMDCITZN',
       'DMDYRSUS', 'DMDEDUC3', 'DMDEDUC2', 'DMDSCHOL', 'DMDMARTL'],
      dtype='object')

I trusted that I could trim whitespace before printing out all of the information about those columns (which would have been a lot) so I did that with

for col in obj_columns:
    demo.loc[:, col] = demo.loc[:, col].str.strip()

and then just spot-checked one of the columns before and after.

Then to get all of the information I wanted to do the rest of the fixes I basically created a reference output cell with the Feature (column name), Number of unique entries, then a list of those entries using

for col in obj_columns:
    print(col)
    print(len(demo[col].unique()))
    print(demo[col].unique())

Which gave something like

RIDSTATR
11
['Exam' 'Both' 'exam' 'Both Interviewed and MEC examined'
 'Both Interviewed and MCE examined' 'Interview Only' nan 'Only Interview'
 'Interviewed Only' 'Interview' 'interview']
RIAGENDR
5
['Female' 'Male' 'F' 'M' nan]
RIDRETH1
6
['Non-Hispanic Black' 'Non-Hispanic White'
 'Other Race - Including Multi-Racial' 'Mexican American' 'Other Hispanic'
 nan]
...

And from there I was able to reference that cell and the codebook to build the replacement dictionary.

The dictionary step still felt like a lot of manual work, but maybe that's the way it is or there are advanced techniques for getting those replacements.