Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Data Analysis Cleaning and Preparing Data Handling Bad Data Missing Data

Find the column with the highest percentage of missing information in demographics

Hi, Although I've finished the practice question. I was hoping if anyone could share their input if there's a simpler/easier way to solve this problem. My solution is as follows:

valid_entries = demo.count()
total_rows = len(demo.index)
missing_data = total_rows - valid_entries
missing_data.head()
missing_percentage = missing_data / total_rows * 100
missing_percentage.head()

missing_percentage_array = np.array(list(missing_percentage[:,]))
max_missing_perc_index = np.where(missing_percentage_array == 
missing_percentage.max())
np.array(list(missing_percentage.index))[max_missing_perc_index]

I'm quite certain there's an easier method to solve this and would love to know! For instance i was able to find the maximum missing percentage value directly from the dataframe (missing_percentage) but I couldn't find it's corresponding row label. So instead converted the list of values to a np.array, found the index of the largest percentage value, and used that as an index to find the corresponding row label, which was separately converted to a np.array.

Thanks and greatly appreciated!

1 Answer

Alex Koumparos
seal-mask
.a{fill-rule:evenodd;}techdegree
Alex Koumparos
Python Development Techdegree Student 36,887 Points

Hi Jason,

Using just the methods we've already seen, once you've got your missing_percentage Series you can do this:

>>> missing_percentage.sort_values(ascending=False).index[0]
'DMARACE'

Exploring Pandas a bit further, there is a built-in method called idxmax() that does exactly what we want:

>>> missing_percentage.idxmax()
'DMARACE'

Hope that helps.

Cheers.

Alex