Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

Data Analysis Cleaning and Preparing Data Handling Bad Data Duplicated Data

Why can we assume rows are only duplicated once?

Why can we assume rows are only duplicated once in this dataset? Is there a different way to resolve duplicated rows when they might appear 3 times or more?

1 Answer

This should not be a problem as itertools.product() handles this for us. To explain this in greater detail let's say we have a row that has not one but two duplicates so row 1, 2 and 3 have the same SEQN and are considered the same:

dup_rows = [1, 2, 3]

This means that the rows at the indexes 1, 2 and 3 are from the same person and should be merge into one row

for (row1, row2) in itertools.product(dup_rows, repeat=2):
    # this does the merging of two rows
    demo.iloc[row1, :] = demo.iloc[row1, :].fillna(demo.iloc[row2, :], axis=0)

The combinations of rows itertools.product() gives us are the following:

1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3

This tells us that we are first filling all nans in the first duplicate with its own values (this useless as it inserts the same nans) and then with the ones from the next duplicate (1, 2) and then with the third (1, 3) and so on for the next one (2, x). (this is called a cartesian product)

When we are only using the last entry in this case row 3 everything up to (2, 3) is useless as it merges the previous duplicates with each other

In summary, the merging strategy shown works even if there are more than 2 duplicates, but it quickly gets inefficient as it merges every duplicate with every duplicate (of that row of course);