Filling missing values with values from most similar row

Question

asked Jul 10, 2019 in Data Science by sourav (17.6k points)

I have the following table. Some values are NaNs. Let's assume that columns are highly correlated. Taking row 0 and row 5 I say that value in col2 will be 4.0. Same situation for row 1 and row 4. But in case of row 6, there is no perfectly matching sample so I should take most similar row - in this case, row 0 and change NaN to 3.0. How should I approach it? Is there any pandas function that can do this?

example = pd.DataFrame({"col1": [3, 2, 8, 4, 2, 3, np.nan],
"col2": [4,3,6, np.nan, 3, np.nan, 5],
"col3": [7,8,9, np.nan, np.nan, 7, 7],
"col4": [7,8,9,np.nan, np.nan, 7, 6]})

Output:
col1 col2 col3 col4
0 3.0 4.0 7.0 7.0
1 2.0 3.0 8.0 8.0
2 8.0 6.0 9.0 9.0
3 4.0 NaN NaN NaN
4 2.0 3.0 NaN NaN
5 3.0 NaN 7.0 7.0
6 NaN 5.0 7.0 6.0

1 Answer

Shlok Pandey · Answer 1 · 2019-07-12T06:08:39+0000

Try this below code for the desired output:

s=df.values
t=np.all((s==s[:,None])|np.isnan(s),-1)
idx=pd.DataFrame(t).where(t).stack().index
Get the pair for each row
df=df.reindex(idx.get_level_values(1))

Reorder df to the idx we just get

Here, first will pick up the first not NaN values.

df.groupby(level=[0]).transform('first').groupby(level=1).first()

Using two times groupby with first , we will get the desired output.

Out[217]:
col1 col2 col3 col4
0 3.0 4.0 7.0 7.0
1 2.0 3.0 8.0 8.0
2 8.0 6.0 9.0 9.0
3 4.0 NaN NaN NaN
4 2.0 3.0 8.0 8.0
5 3.0 4.0 7.0 7.0
6 NaN 5.0 7.0 6.0

If you are interested in learning Data Science then watch this video:

Filling missing values with values from most similar row

Filling missing values with values from most similar row

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions