2 views
in Python

I am interested in knowing how to convert a pandas dataframe into a NumPy array.

dataframe:

import numpy as np

import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]

a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]

b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]

c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]

df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)

df = df.rename_axis('ID')

gives

label   A B      C

ID

1       NaN 0.2    NaN

2       NaN NaN    0.5

3       NaN 0.2    0.5

4       0.1 0.2    NaN

5       0.1 0.2    0.5

6       0.1 NaN    0.5

7       0.1 NaN    NaN

I would like to convert this to a NumPy array, as so:

array([[ nan, 0.2, nan],

[ nan, nan, 0.5],

[ nan, 0.2, 0.5],

[ 0.1, 0.2, nan],

[ 0.1, 0.2, 0.5],

[ 0.1, nan, 0.5],
[ 0.1, nan, nan]])

How can I do this?

As a bonus, is it possible to preserve the dtypes, like this?

array([[ 1, nan, 0.2, nan],

[ 2, nan, nan, 0.5],

[ 3, nan, 0.2, 0.5],

[ 4, 0.1, 0.2, nan],

[ 5, 0.1, 0.2, 0.5],

[ 6, 0.1, nan, 0.5],

[ 7, 0.1, nan, nan]],

dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

or similar?

by (106k points)

To convert a pandas dataframe into a NumPy array you can use df.values in your code just add .values() with the rename_axis() function and you will get the converted NumPy array from pandas dataframe.

Below is the code for the same:-

import numpy as np

import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]

a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]

b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]

c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]

df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)

df = df.rename_axis('ID').values

print(df)

If you wish to know what is python visit this python tutorial and python interview questions.

by (32.3k points)

I would just chain the DataFrame.reset_index() and DataFrame.values functions to get the Numpy representation of the dataframe, including the index:

In [8]: df

Out[8]:

A         B         C

0 -0.982726  0.150726  0.691625

1  0.617297 -0.471879  0.505547

2  0.417123 -1.356803 -1.013499

3 -0.166363 -0.957758  1.178659

4 -0.164103  0.074516 -0.674325

5 -0.340169 -0.293698  1.231791

6 -1.062825  0.556273  1.508058

7  0.959610  0.247539  0.091333

[8 rows x 3 columns]

In [9]: df.reset_index().values

Out[9]:

array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],

[ 1.        ,  0.61729734, -0.47187926,  0.50554728],

[ 2.        ,  0.4171228 , -1.35680324, -1.01349922],

[ 3.        , -0.16636303, -0.95775849,  1.17865945],

[ 4.        , -0.16410334,  0.0745164 , -0.67432474],

[ 5.        , -0.34016865, -0.29369841,  1.23179064],

[ 6.        , -1.06282542,  0.55627285,  1.50805754],

[ 7.        ,  0.95961001,  0.24753911,  0.09133339]])

In order to get the dtypes we'd need to transform this ndarray into a structured array using view:

In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])

Out[10]:

array([( 0, -0.98272574,  0.150726  ,  0.69162512),

( 1,  0.61729734, -0.47187926,  0.50554728),

( 2,  0.4171228 , -1.35680324, -1.01349922),

( 3, -0.16636303, -0.95775849,  1.17865945),

( 4, -0.16410334,  0.0745164 , -0.67432474),

( 5, -0.34016865, -0.29369841,  1.23179064),

( 6, -1.06282542,  0.55627285,  1.50805754),

( 7,  0.95961001,  0.24753911,  0.09133339),

dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])