-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Closed
Labels
Needs InfoClarification about behavior needed to assess issueClarification about behavior needed to assess issue
Description
Minimal example
Requires bottleneck, numpy and pandas to be installed:
>>> import numpy as np
>>> import pandas as pd
>>> import bottleneck as bn
>>> data = np.ones(2**25, dtype=np.float32)
>>> pd.Series(data).mean() # wrong
0.5
>>> bn.nanmean(data) # wrong
0.5
>>> data.mean() # correct
1.0
Problem description
The mean()
of large float32 Series is wrong when bottleneck is used. Uninstalling bottleneck or using float64 is a valid workaround. xarray is or has been affected too, see pydata/xarray#1346.
Bottleneck's documentation explicitly mentions that no error is raised in case of an overflow, not sure if this is still to be considered as bug in bottleneck. Anyhow since it seems quite severe, I want to raise attention here too.
Update: This is not an overflow, it's a numerical error (which is very high because bottleneck does not use pairwise summation).
Bottleneck's implementation of mean().
Related issues
- same thing in xarray: bottleneck : Wrong mean for float32 array pydata/xarray#1346
- same thing in aospy: Inaccuracy of some operations (e.g. stacked average) when underlying data is float32 spencerahill/aospy#217
- similar bug in pandas with bottleneck, but related to int not float: BUG: int64 overflow/wrap around with sum() #15453 and int64 overflow/wrap around with nansum() pydata/bottleneck#163
- another very similar but older bug in pandas with bottleneck but related to int not float: Bug in pd.Series.mean() #6915 BUG: nansum platform overflow pydata/bottleneck#83
- something different, not to be confused - race conditions when reading netcdf files: Issues (wrong result) when computing the mean on a NetCDF file dask/dask#2095
- probably not related, but who knows: Unexpected behavior for bn.move_std with float32 array pydata/bottleneck#164
Expected Output
>>> data = np.ones(2**25, dtype=np.float32)
>>> pd.Series(data).mean()
1.0
Output of pd.show_versions()
$ pip3 freeze
Bottleneck==1.2.1
numpy==1.16.1
pandas==0.24.1
...
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-13-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: None
pip: 19.0.2
setuptools: 40.8.0
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: 0.11.3
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Metadata
Metadata
Assignees
Labels
Needs InfoClarification about behavior needed to assess issueClarification about behavior needed to assess issue