2 views

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in a subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,

2.051450375415418849e-05,

1.963190184049079707e-05,

1.258362869906251862e-04,

1.503352476021528139e-04,

6.881341586355676286e-06

by (33.1k points)

Quantile normalization can be done easily in python by using the following method:

Creating an sample dataframe:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},

'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},

'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

df

Out:

C1  C2 C3

A   5 4   3

B   2 1   4

C   3 4   6

D   4 2   8

For each rank, the mean value can be calculated with the following:

rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()

rank_mean

Out:

1    2.000000

2    3.000000

3    4.666667

4    5.666667

dtype: float64

Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

df.rank(method='min').stack().astype(int).map(rank_mean).unstack()

Out:

C1        C2 C3

A  5.666667  4.666667 2.000000

B  2.000000  2.000000 3.000000

C  3.000000  4.666667 4.666667

D  4.666667  3.000000 5.666667