2 views

How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a large matrix or Efficient way to get highly correlated pairs from large data set in Python or R), but I am wondering how to do it with pandas? In my case the matrix is 4460x4460, so can't do it visually.

by (41.4k points)

First unstack and then order the DataFrame:

import pandas as pd

import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()

so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

Here is the output:

2192  1522    0.636198

1522  2192    0.636198

3677  2027    0.641817

2027  3677    0.641817

242   130     0.646760

130   242     0.646760

1171  2733    0.670048

2733  1171    0.670048

1000  2000    0.742340

2000  1000    0.742340

dtype: float64

If You want to learn data science with python visit this data science online course by intellipaat.