You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Came across this one while fixing #35584 in #35604. sort_values reverses duplicate order when ascending=False. This is clearest when calling Index.sort_values, because it can return an indexer, but it's also true for Series.sort_values and it propagates to DataFrame.sort_values.
#35604 will make sorting in descending order stable for most Index types (leveraging nargsort from sorting.py), but the problem will remain for datetime-like index types and for Series and will require fixing.
Expected Output
array([3, 2, 0, 1], dtype=int64)
Duplicates should maintain order when descending=False. This will also let us leverage the same sorting algorithm both for Index and Series.
Then consider that you might be sorting a DataFrame with several columns, and a column with duplicates might be the first one. In this case you likely wouldn't expect a descending sort to change duplicate order. Or you could be using something like nlargest and get weirdness because there is a descending sort in there and it swaps elements.
Obviously, we could get by with a convention that we always revert duplicate order with a descending sort by being careful, but I believe keeping duplicate order is cleaner. In cases where it doesn't matter, it's the same, and when it does matter (as in nlargest and the like), you don't need to remember that you need extra reversals.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : d0ca4b3
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None
The text was updated successfully, but these errors were encountered:
AlexKirko
added
Bug
Needs Triage
Issue that has not been reviewed by a pandas team member
Algos
Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
Index
Related to the Index class or subclasses
and removed
Needs Triage
Issue that has not been reviewed by a pandas team member
labels
Aug 27, 2020
@simonjayhawkins Sorry for taking so long to get back to you. I got sick and was out of commision for a couple of weeks.
#35604 fixed this for non-datetime-like Index subtypes. It still needs fixing for other subclasses and the Series object. I'll get back to working on it this week.
Uh oh!
There was an error while loading. Please reload this page.
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
Came across this one while fixing #35584 in #35604.
sort_values
reverses duplicate order whenascending=False
. This is clearest when callingIndex.sort_values
, because it can return an indexer, but it's also true forSeries.sort_values
and it propagates toDataFrame.sort_values
.#35604 will make sorting in descending order stable for most
Index
types (leveraging nargsort fromsorting.py
), but the problem will remain for datetime-like index types and for Series and will require fixing.Expected Output
Duplicates should maintain order when
descending=False
. This will also let us leverage the same sorting algorithm both forIndex
andSeries
.Additional use cases
Some additional use cases from the PR.
I don't think that swapping is expected here.
Then consider that you might be sorting a DataFrame with several columns, and a column with duplicates might be the first one. In this case you likely wouldn't expect a descending sort to change duplicate order. Or you could be using something like
nlargest
and get weirdness because there is a descending sort in there and it swaps elements.Obviously, we could get by with a convention that we always revert duplicate order with a descending sort by being careful, but I believe keeping duplicate order is cleaner. In cases where it doesn't matter, it's the same, and when it does matter (as in
nlargest
and the like), you don't need to remember that you need extra reversals.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : d0ca4b3
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None
pandas : 0.26.0.dev0+4054.gd0ca4b347
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200714
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.23.3
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : 0.6.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.3.1
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0
The text was updated successfully, but these errors were encountered: