Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I have the following table. Some values are NaNs. Let's assume that columns are highly correlated. Taking row 0 and row 5 I say that value in col2 will be 4.0. Same situation for row 1 and row 4. But in case of row 6, there is no perfectly matching sample so I should take most similar row - in this case, row 0 and change NaN to 3.0. How should I approach it? Is there any pandas function that can do this?

example = pd.DataFrame({"col1": [3, 2, 8, 4, 2, 3, np.nan], 

                        "col2": [4,3,6, np.nan, 3, np.nan, 5], 

                        "col3": [7,8,9, np.nan, np.nan, 7, 7], 

                        "col4": [7,8,9,np.nan, np.nan, 7, 6]})

Output:

    col1    col2    col3    col4

0   3.0     4.0     7.0     7.0

1   2.0     3.0     8.0     8.0

2   8.0     6.0     9.0     9.0

3   4.0     NaN     NaN     NaN

4   2.0     3.0     NaN     NaN

5   3.0     NaN     7.0     7.0

6   NaN     5.0     7.0     6.0

1 Answer

0 votes
by (41.4k points)
edited by

Try this below code for the desired output:

s=df.values

t=np.all((s==s[:,None])|np.isnan(s),-1)

idx=pd.DataFrame(t).where(t).stack().index

 Get the pair for each row

df=df.reindex(idx.get_level_values(1))

Reorder  df to the idx we just get 

Here, first will pick up the first not NaN values.

df.groupby(level=[0]).transform('first').groupby(level=1).first()  

 Using two times groupby with first , we will get the desired output.

Out[217]: 

   col1  col2 col3  col4

0   3.0   4.0 7.0   7.0

1   2.0   3.0 8.0   8.0

2   8.0   6.0 9.0   9.0

3   4.0   NaN NaN   NaN

4   2.0   3.0 8.0   8.0

5   3.0   4.0 7.0   7.0

6   NaN   5.0 7.0   6.0

If you are interested in learning Data Science then watch this video:

Browse Categories

...