# Filling missing values with values from most similar row

1 view

I have the following table. Some values are NaNs. Let's assume that columns are highly correlated. Taking row 0 and row 5 I say that value in col2 will be 4.0. Same situation for row 1 and row 4. But in case of row 6, there is no perfectly matching sample so I should take most similar row - in this case, row 0 and change NaN to 3.0. How should I approach it? Is there any pandas function that can do this?

example = pd.DataFrame({"col1": [3, 2, 8, 4, 2, 3, np.nan],

"col2": [4,3,6, np.nan, 3, np.nan, 5],

"col3": [7,8,9, np.nan, np.nan, 7, 7],

"col4": [7,8,9,np.nan, np.nan, 7, 6]})

Output:

col1    col2    col3    col4

0   3.0     4.0     7.0     7.0

1   2.0     3.0     8.0     8.0

2   8.0     6.0     9.0     9.0

3   4.0     NaN     NaN     NaN

4   2.0     3.0     NaN     NaN

5   3.0     NaN     7.0     7.0

6   NaN     5.0     7.0     6.0

by (34.6k points)
edited

Try this below code for the desired output:

s=df.values

t=np.all((s==s[:,None])|np.isnan(s),-1)

idx=pd.DataFrame(t).where(t).stack().index

Get the pair for each row

df=df.reindex(idx.get_level_values(1))

Reorder  df to the idx we just get

Here, first will pick up the first not NaN values.

df.groupby(level=).transform('first').groupby(level=1).first()

Using two times groupby with first , we will get the desired output.

Out:

col1  col2 col3  col4

0   3.0   4.0 7.0   7.0

1   2.0   3.0 8.0   8.0

2   8.0   6.0 9.0   9.0

3   4.0   NaN NaN   NaN

4   2.0   3.0 8.0   8.0

5   3.0   4.0 7.0   7.0

6   NaN   5.0 7.0   6.0

If you are interested in learning Data Science then watch this video: