Skip to content

Wrong result for float32 Series when using bottleneck #25307

@lumbric

Description

@lumbric

Minimal example

Requires bottleneck, numpy and pandas to be installed:

>>> import numpy as np
>>> import pandas as pd
>>> import bottleneck as bn
>>> data = np.ones(2**25, dtype=np.float32)
>>> pd.Series(data).mean()  # wrong
0.5
>>> bn.nanmean(data)  # wrong
0.5
>>> data.mean()  # correct
1.0

Problem description

The mean() of large float32 Series is wrong when bottleneck is used. Uninstalling bottleneck or using float64 is a valid workaround. xarray is or has been affected too, see pydata/xarray#1346.

Bottleneck's documentation explicitly mentions that no error is raised in case of an overflow, not sure if this is still to be considered as bug in bottleneck. Anyhow since it seems quite severe, I want to raise attention here too.

Update: This is not an overflow, it's a numerical error (which is very high because bottleneck does not use pairwise summation).

Bottleneck's implementation of mean().

Related issues

Expected Output

>>> data = np.ones(2**25, dtype=np.float32)
>>> pd.Series(data).mean()
1.0

Output of pd.show_versions()

$ pip3 freeze 
Bottleneck==1.2.1
numpy==1.16.1
pandas==0.24.1
...
>>> pd.show_versions()                                                                                           
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-13-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.2
setuptools: 40.8.0
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: 0.11.3
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs InfoClarification about behavior needed to assess issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions