mean from pandas and numpy differ

Question

asked Oct 5, 2019 in Data Science by sourav (17.6k points)

I have a MEMS IMU on which I've been collecting data and I'm using pandas to get some statistical data from it. There are 6 32-bit floats collected each cycle. Data rates are fixed for a given collection run. The data rates vary between 100Hz and 1000Hz and the collection times run as long as 72 hours. The data is saved in a flat binary file. I read the data this way:

import numpy as np
import pandas as pd
dataType=np.dtype([('a','<f4'),('b','<f4'),('c','<f4'),('d','<f4'),('e','<f4'),('e','<f4')])
df=pd.DataFrame(np.fromfile('FILENAME',dataType))
df['c'].mean()
-9.880581855773926
x=df['c'].values
x.mean()
-9.8332081

-9.833 is the correct result. I can create a similar result that someone should be able to repeat this way:

import numpy as np
import pandas as pd
x=np.random.normal(-9.8,.05,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-9.859579086303711
x.mean()
-9.8000648778888628

I've repeated this on linux and windows, on AMD and Intel processors, in Python 2.7 and 3.5. I'm stumped. What am I doing wrong? And get this:

x=np.random.normal(-9.,.005,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-8.999998092651367
x.mean()
-9.0000075889406528

I could accept this difference. It's at the limit of the precision of 32 bit floats.

NEVERMIND. I wrote this on Friday and the solution hit me this morning. It is a floating point precision problem exacerbated by the large amount of data. I needed to convert the data into 64 bit float on the creation of the dataframe this way:

df=pd.DataFrame(np.fromfile('FILENAME',dataType),dtype='float64')

I'll leave the post should anyone else run into a similar issue.

1 Answer

Shlok Pandey · Answer 1 · 2019-10-05T08:59:43+0000

However, numpy has higher probability to be less wrong than panda.

There is no fundamental difference between using float32 and float64, however, for float32, problems can be observed for smaller data sets than for float64.

It is not really defined, how the mean should be calculated - the given mathematical definition is only unambiguous for infinitely precise numbers, but not for the floating point operations our PCs are using.

So what is the "right" formula?

mean = (x0+..xn)/n
or
mean = [(x0+x1)+(x2+x3)+..]/n
or
mean = 1.0/n*(x0+..xn)

and so on...

Obviously, when calculated on modern hardware they all will give different results - one would ideally peek a formula which makes the smallest error compared to a theoretical right value (which is calculated with infinite precision).

Numpy uses slightly alternated pairwise summation, i.e. (((x1+x2)+(x3+x4))+(...)), which, even if not perfect, is known to be quite good. On the other hand, bottleneck uses the naive summation x1+x2+x3+...:

REDUCE_ALL(nanmean, DTYPE0)
{
...
WHILE {
FOR {
ai = AI(DTYPE0);
if (ai == ai) {
asum += ai; <---- HERE WE GO
count += 1;
}
}
NEXT
}
...
}

and we can easily see what goes on: After some steps, bottleneck sums one large (sum of all previous elements, proportional to -9.8*number_of_steps) and one small element (about -9.8) which leads to quite an rounding error of about big_number*eps, with eps being around 1e-7 for float32. This means that after 10^6 summations we could have a relative error of about 10% (eps*10^6, this is an upper bound).

For float64 and eps being about 1e-16 the relative error would be only about 1e-10 after 10^6 summations. It might seem to be precise to us, but measured against the possible precision it is still a fiasco!

Numpy on the other hand (at least for the series at hand) will add two elements which are almost equal. In this case the upper bound for the resulting relative error is eps*log_2(n), which is

maximal 2e-6 for float32 and 10^6 elements

maximal 2e-15 for float64 and 10^6 elements.

From the above, among others, there are following noteworthy implications:

if the mean of the distribution is 0, then pandas and numpy are almost equally precise - the magnitude of summed numbers is about 0.0 and there is no big difference between summands which would lead to a large rounding error for naive summation.

if one knows a good estimate for the mean, it might be more robust to calculate the sum of x'i=xi-mean_estimate, because x'i will have mean of 0.0.

something like x=(.333*np.ones(1000000)).astype(np.float32) is enough to trigger the strange behavior of pandas' version - no need for randomness, and we know what the result should be, don't we? It is important, that 0.333 cannot be presented precisely with floating point.

NB: The above holds for 1-dimensional numpy-arrays. Situation is more complicated for summing along an axis for multi-dimensional numpy-arrays, as numpy sometimes switches to naive-summation. For a more detailed investigation see this SO-post, which also explains @Mark Dickinson observation, i.e.:

np.ones((2, 10**8), dtype=np.float32).mean(axis=1) are accurate but np.ones((10**8, 2), dtype=np.float32).mean(axis=0) aren't

mean from pandas and numpy differ

1 Answer

Related questions

Browse Categories