Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Data Analysis Cleaning and Preparing Data Handling Bad Data Simple Data Issues

Cody Stephenson
Cody Stephenson
8,361 Points

Here are my function-based attempts to trim whitespace and get other information.

I didn't want to go through every potentially bad column one at a time so I tried to organize some of the information first.

# Let's turn the column names to a list for ease of use
obj_columns = demo.select_dtypes(include=['object'])
obj_columns.columns

Gives us

Index(['RIDSTATR', 'RIAGENDR', 'RIDRETH1', 'DMQMILIT', 'DMDBORN', 'DMDCITZN',
       'DMDYRSUS', 'DMDEDUC3', 'DMDEDUC2', 'DMDSCHOL', 'DMDMARTL'],
      dtype='object')

I trusted that I could trim whitespace before printing out all of the information about those columns (which would have been a lot) so I did that with

for col in obj_columns:
    demo.loc[:, col] = demo.loc[:, col].str.strip()

and then just spot-checked one of the columns before and after.

Then to get all of the information I wanted to do the rest of the fixes I basically created a reference output cell with the Feature (column name), Number of unique entries, then a list of those entries using

for col in obj_columns:
    print(col)
    print(len(demo[col].unique()))
    print(demo[col].unique())

Which gave something like

RIDSTATR
11
['Exam' 'Both' 'exam' 'Both Interviewed and MEC examined'
 'Both Interviewed and MCE examined' 'Interview Only' nan 'Only Interview'
 'Interviewed Only' 'Interview' 'interview']
RIAGENDR
5
['Female' 'Male' 'F' 'M' nan]
RIDRETH1
6
['Non-Hispanic Black' 'Non-Hispanic White'
 'Other Race - Including Multi-Racial' 'Mexican American' 'Other Hispanic'
 nan]
...

And from there I was able to reference that cell and the codebook to build the replacement dictionary.

The dictionary step still felt like a lot of manual work, but maybe that's the way it is or there are advanced techniques for getting those replacements.