However, numpy has higher probability to be less wrong than panda.
There is no fundamental difference between using float32 and float64, however, for float32, problems can be observed for smaller data sets than for float64.
It is not really defined, how the mean should be calculated - the given mathematical definition is only unambiguous for infinitely precise numbers, but not for the floating point operations our PCs are using.
So what is the "right" formula?
mean = (x0+..xn)/n
or
mean = [(x0+x1)+(x2+x3)+..]/n
or
mean = 1.0/n*(x0+..xn)
and so on...
Obviously, when calculated on modern hardware they all will give different results - one would ideally peek a formula which makes the smallest error compared to a theoretical right value (which is calculated with infinite precision).
Numpy uses slightly alternated pairwise summation, i.e. (((x1+x2)+(x3+x4))+(...)), which, even if not perfect, is known to be quite good. On the other hand, bottleneck uses the naive summation x1+x2+x3+...:
REDUCE_ALL(nanmean, DTYPE0)
{
...
WHILE {
FOR {
ai = AI(DTYPE0);
if (ai == ai) {
asum += ai; <---- HERE WE GO
count += 1;
}
}
NEXT
}
...
}
and we can easily see what goes on: After some steps, bottleneck sums one large (sum of all previous elements, proportional to -9.8*number_of_steps) and one small element (about -9.8) which leads to quite an rounding error of about big_number*eps, with eps being around 1e-7 for float32. This means that after 10^6 summations we could have a relative error of about 10% (eps*10^6, this is an upper bound).
For float64 and eps being about 1e-16 the relative error would be only about 1e-10 after 10^6 summations. It might seem to be precise to us, but measured against the possible precision it is still a fiasco!
Numpy on the other hand (at least for the series at hand) will add two elements which are almost equal. In this case the upper bound for the resulting relative error is eps*log_2(n), which is
maximal 2e-6 for float32 and 10^6 elements
maximal 2e-15 for float64 and 10^6 elements.
From the above, among others, there are following noteworthy implications:
if the mean of the distribution is 0, then pandas and numpy are almost equally precise - the magnitude of summed numbers is about 0.0 and there is no big difference between summands which would lead to a large rounding error for naive summation.
if one knows a good estimate for the mean, it might be more robust to calculate the sum of x'i=xi-mean_estimate, because x'i will have mean of 0.0.
something like x=(.333*np.ones(1000000)).astype(np.float32) is enough to trigger the strange behavior of pandas' version - no need for randomness, and we know what the result should be, don't we? It is important, that 0.333 cannot be presented precisely with floating point.
NB: The above holds for 1-dimensional numpy-arrays. Situation is more complicated for summing along an axis for multi-dimensional numpy-arrays, as numpy sometimes switches to naive-summation. For a more detailed investigation see this SO-post, which also explains @Mark Dickinson observation, i.e.:
np.ones((2, 10**8), dtype=np.float32).mean(axis=1) are accurate but np.ones((10**8, 2), dtype=np.float32).mean(axis=0) aren't