Why can we assume rows are only duplicated once?

Question

Why can we assume rows are only duplicated once in this dataset?  Is there a different way to resolve duplicated rows when they might appear 3 times or more?

Hakim Rachidi · Answer

This should not be a problem as itertools.product() handles this for us. To explain this in greater detail let's say we have a row that has not one but two duplicates so row 1, 2 and 3 have the same SEQN and are considered the same:
python
dup_rows = [1, 2, 3]

This means that the rows at the indexes 1, 2 and 3 are from the same person and should be merge into one row
python
for (row1, row2) in itertools.product(dup_rows, repeat=2):
    # this does the merging of two rows
    demo.iloc[row1, :] = demo.iloc[row1, :].fillna(demo.iloc[row2, :], axis=0)

The combinations of rows itertools.product() gives us are the following:

1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3

This tells us that we are first filling all nans in the first duplicate with its own values (this useless as it inserts the same nans) and then with the ones from the next duplicate (1, 2) and then with the third (1, 3) and so on for the next one (2, x).
(this is called a cartesian product (https://en.wikipedia.org/wiki/Cartesian_product))
When we are only using the last entry in this case row 3 everything up to (2, 3) is useless as it merges the previous duplicates with each other
In summary, the merging strategy shown works even if there are more than 2 duplicates, but it quickly gets inefficient as it merges every duplicate with every duplicate (of that row of course);

Welcome to the Treehouse Community

Looking to learn something new?

pegamoose

pegamoose

Why can we assume rows are only duplicated once?

1 Answer

Hakim Rachidi

Hakim Rachidi