Skip to content

BUG: Index.sort_values and Series.sort_values reverse duplicate order when ascending=False #35922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
AlexKirko opened this issue Aug 27, 2020 · 3 comments · Fixed by #37310
Closed
3 tasks done
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Index Related to the Index class or subclasses

Comments

@AlexKirko
Copy link
Member

AlexKirko commented Aug 27, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

IN:
import pandas as pd
ind = pd.Index([1, 1, 2, 3])

ind, indexer = ind.sort_values(ascending=False, return_indexer=True)
indexer
OUT:
array([3, 2, 1, 0], dtype=int64)

Problem description

Came across this one while fixing #35584 in #35604. sort_values reverses duplicate order when ascending=False. This is clearest when calling Index.sort_values, because it can return an indexer, but it's also true for Series.sort_values and it propagates to DataFrame.sort_values.

#35604 will make sorting in descending order stable for most Index types (leveraging nargsort from sorting.py), but the problem will remain for datetime-like index types and for Series and will require fixing.

Expected Output

array([3, 2, 0, 1], dtype=int64)

Duplicates should maintain order when descending=False. This will also let us leverage the same sorting algorithm both for Index and Series.

Additional use cases

Some additional use cases from the PR.

s = pd.Series(["A", "AA", "BB", "CAC"], dtype="object")
s.sort_values(ascending=False, key=lambda ser: ser.str.len())

OUT:
3    CAC
2     BB
1     AA
0      A
dtype: object

I don't think that swapping is expected here.

Then consider that you might be sorting a DataFrame with several columns, and a column with duplicates might be the first one. In this case you likely wouldn't expect a descending sort to change duplicate order. Or you could be using something like nlargest and get weirdness because there is a descending sort in there and it swaps elements.

Obviously, we could get by with a convention that we always revert duplicate order with a descending sort by being careful, but I believe keeping duplicate order is cleaner. In cases where it doesn't matter, it's the same, and when it does matter (as in nlargest and the like), you don't need to remember that you need extra reversals.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d0ca4b3
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None

pandas : 0.26.0.dev0+4054.gd0ca4b347
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200714
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.23.3
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : 0.6.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.3.1
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

@AlexKirko AlexKirko added Bug Needs Triage Issue that has not been reviewed by a pandas team member Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Index Related to the Index class or subclasses and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 27, 2020
@AlexKirko
Copy link
Member Author

take

@simonjayhawkins
Copy link
Member

@AlexKirko i'm getting the expected output on master.

>>> import pandas as pd
>>> ind = pd.Index([1, 1, 2, 3])
>>>
>>> ind, indexer = ind.sort_values(ascending=False, return_indexer=True)
>>> indexer
array([3, 2, 0, 1], dtype=int64)
>>> pd.__version__
'1.2.0.dev0+314.g44e933a765'
>>>

did #35604 fix this?

@AlexKirko
Copy link
Member Author

@simonjayhawkins Sorry for taking so long to get back to you. I got sick and was out of commision for a couple of weeks.

#35604 fixed this for non-datetime-like Index subtypes. It still needs fixing for other subclasses and the Series object. I'll get back to working on it this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants