Correlation among multiple categorical variables (Pandas)

Question

asked Mar 24, 2021 in Python by laddulakshana (16.4k points)
closed Jul 7, 2023 by Balram111

I have a Dataset that is made of 22 Categorical variables (non-requested). I might want to picture their correlation in a decent heatmap. Since the Pandas built in function

DataFrame.corr(method='pearson', min_periods=1)

just execute correlation coefficients for mathematical factors (Pearson, Kendall, Spearman), I need to total it myself to play out a chi-square or something like it and I am not exactly sure what function use to do it in one exquisite advance (as opposed to emphasizing through all the cat1*cat2 sets). Honestly, this is the thing that I might want to wind up with (a dataframe):

cat1 cat2 cat3
cat1| coef coef coef
cat2| coef coef coef
cat3| coef coef coef

Do you have any idea about pd.pivot_table?

Thanks in advance

closed

4 Answers

answered Jul 7, 2023 by Balram111 (25.7k points)

Best answer

Yes, the pd.pivot_table() function in Pandas can be useful for generating a table of correlation coefficients between categorical variables. However, since correlation coefficients like Pearson, Kendall, or Spearman are not applicable to categorical variables, you cannot directly compute them using DataFrame.corr() or pd.pivot_table().

To compute the correlation between categorical variables, you can use the chi-square test of independence. Here's an approach you can follow:

Create a contingency table using pd.crosstab() to count the occurrences of each combination of categorical variables.

python

Copy code

contingency_table = pd.crosstab(df['cat1'], df['cat2'])

Apply the chi-square test of independence using scipy.stats.chi2_contingency() to obtain the chi-square statistic, p-value, and other relevant information.

from scipy.stats import chi2_contingency

chi2, p_value, _, _ = chi2_contingency(contingency_table)

You can convert the chi-square statistic into a measure of association like Cramer's V for a better understanding of the strength of the relationship between the variables.

n = contingency_table.sum().sum()

phi_c = np.sqrt(chi2 / (n * min(contingency_table.shape) - 1))

Create a new DataFrame to store the correlation coefficients.

correlation_df = pd.DataFrame(data=np.zeros_like(contingency_table.values), columns=contingency_table.columns, index=contingency_table.index)

Fill the DataFrame with the correlation coefficients.

correlation_df.iloc[:, :] = phi_c

The resulting correlation_df DataFrame will contain the correlation coefficients between the categorical variables based on the chi-square test of independence.

Note: This method assumes that each categorical variable has more than two distinct categories. If a categorical variable has only two categories, you might consider using other measures like point biserial correlation or the tetrachoric correlation coefficient.

I hope this helps you visualize the correlation between your categorical variables using a chi-square-based approach!

hari_sh · Answer 1 · 2021-03-24T05:56:50+0000

You can try to utilize pd.factorize

df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
Out[32]:
a c d
a 1.0 1.0 1.0
c 1.0 1.0 1.0
d 1.0 1.0 1.0

Data input:

df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})

Update:

from scipy.stats import chisquare
df=df.apply(lambda x : pd.factorize(x)[0])+1
pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])
Out[123]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})

Want to become an expert in Python? Join the python course fast!

Similu · Answer 2 · 2023-07-07T16:40:47+0000

To visualize the correlation between categorical variables in a heatmap, you can use the chi-square test of independence. Here's a concise approach:

import pandas as pd

from scipy.stats import chi2_contingency

# Create contingency table

contingency_table = pd.crosstab(df['cat1'], df['cat2'])

# Apply chi-square test

chi2, p_value, _, _ = chi2_contingency(contingency_table)

# Compute Cramer's V

n = contingency_table.sum().sum()

phi_c = np.sqrt(chi2 / (n * min(contingency_table.shape) - 1))

# Create correlation DataFrame

correlation_df = pd.DataFrame(data=phi_c, index=contingency_table.index, columns=contingency_table.columns)

# Print correlation DataFrame

print(correlation_df)

In this concise version, the contingency table is created using pd.crosstab(). Then, the chi-square test is applied using chi2_contingency(), and Cramer's V is computed. The resulting correlation coefficients are stored in the correlation_df DataFrame, which can be printed to visualize the correlation heatmap between the categorical variables.

Anamika Chakravarty · Answer 3 · 2023-07-07T16:42:28+0000

To create a correlation heatmap for categorical variables, you can use the chi-square test and Cramer's V statistic. Here's a concise version:

import pandas as pd

from scipy.stats import chi2_contingency

# Create contingency table

contingency_table = pd.crosstab(df['cat1'], df['cat2'])

# Apply chi-square test and compute Cramer's V

chi2, _, _, _ = chi2_contingency(contingency_table)

n = contingency_table.sum().sum()

cramer_v = np.sqrt(chi2 / (n * min(contingency_table.shape) - 1))

# Create correlation DataFrame

correlation_df = pd.DataFrame(data=cramer_v, index=contingency_table.index, columns=contingency_table.columns)

# Print correlation DataFrame

print(correlation_df)

In this concise version, the contingency table is created using pd.crosstab(). The chi-square test is applied using chi2_contingency() to calculate the chi-square statistic. Cramer's V is then computed using the chi-square statistic and the size of the contingency table. The resulting correlation coefficients are stored in the correlation_df DataFrame and printed to visualize the correlation heatmap.

Correlation among multiple categorical variables (Pandas)

4 Answers

Related questions

Browse Categories