Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled  n times.

Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.

Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:

for 1...n:

  for each col in df: shuffle column

return new_df

But hopefully more efficient than naive looping. This does not work for me:

def shuffle(df, n, axis=0):

        shuffled_df = df.copy()

        for k in range(n):

            shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)

        return shuffled_df

df = pandas.DataFrame({'A':range(10), 'B':range(10)})

shuffle(df, 5)

1 Answer

0 votes
by (41.4k points)

Random.permutation function randomly permutes a sequence.

So, using random.permuation function of numpy:

In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [2]: df

Out[2]:

   A  B

0  0  0

1  1  1

2  2  2

3  3  3

4  4  4

5  5  5

6  6  6

7  7  7

8  8  8

9  9  9

In [3]: df.reindex(np.random.permutation(df.index))

Out[3]:

   A  B

0  0  0

5  5  5

6  6  6

3  3  3

8  8  8

7  7  7

9  9  9

1  1  1

2  2  2

4  4  4

If You want to learn data science with python visit this data science tutorial and data science certification by Intellipaat.

Browse Categories

...