-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Correlation inconsistencies between Series and DataFrame #20954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So it seems we are taking the name into account when aligning these for Series [45]. The DataFrame corr effectively ignores this, e.g. [46]
So I think its reasonable to align (e.g. you match index values), but ignore the name. Would take a PR for this, it might be slightly tricky as the magic is done in |
I was looking into this, when I realized that I made a mistake reporting the bug. In A further remark: I was still confused why the results of the correlation were not consistent. Theoretically the result for
Whenever at least one vector of data has a standard deviation of zero, the resulting correlation should always show the same results. In the example above
which explains these inconsistencies. I guess this bug report can be closed unless this numerical problem needs to be discussed any further. I am sorry for the inconvenience. |
Well this is happening because if you look at the formula for correlation which is as follows:
Now as per your given table when I subtract a - a_mean I get 0, now as the denominator becomes 0, hence the answer is |
After doing some experimentation I can conclude with great confidence that the inconsistency between See my answer here https://stackoverflow.com/a/75833486/7012917
|
Sample Code
Problem description
1
For some reason
pandas.DataFrame.corr()
andpandas.Series.corr(other)
show different behavior. In general, the correlation between two Series is not defined when one Series does not have varying values, like e.g.s_a
ors_c
, as the denominator of the correlation function is evaluated to zero, resulting in a by-zero-division. However, the correlation function defined inDataFrame
somehow manages to evaluate something as shown in the following result:2
The above results do also not match when working with Series, which should be expected(?). Note that I have explicitly put
NaN
s at the identities since e.g.s_b.corr(s_b)
does yield an Error.3
Another problem is that by using the existing data instead of newly created series, we get different results.
I hope I did not miss anything.
Expected Output
Both methods in Series and DataFrame should produce the same output.
Output of
pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: