-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: multi-column sort_values speedup #17141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Speedup achieved for sorting by multiple columns: lexsort over multiple columns replaced with mergesorting columns in reverse order than specified in "by" argument of "sort_values".
Codecov Report
@@ Coverage Diff @@
## master #17141 +/- ##
==========================================
- Coverage 91.02% 91.01% -0.02%
==========================================
Files 162 163 +1
Lines 49539 49552 +13
==========================================
+ Hits 45095 45101 +6
- Misses 4444 4451 +7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add asv's for several columns (2 and pick one with 5 say), use different dtypes for the columns (don't all have to be different, but use datatime, int, string).
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -247,7 +247,7 @@ Performance Improvements | |||
~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
- Improved performance of instantiating :class:`SparseDataFrame` (:issue:`16773`) | |||
|
|||
- Improved performance of :function:`sort_values` function of :class:`DataFrame` when sorting by multiple columns (:issue:`16995`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:func:`DataFrame.sort_values`
pandas/core/frame.py
Outdated
@@ -3418,48 +3417,48 @@ def sort_values(self, by, axis=0, ascending=True, inplace=False, | |||
if is_sequence(ascending) and len(by) != len(ascending): | |||
raise ValueError('Length of ascending (%d) != length of by (%d)' % | |||
(len(ascending), len(by))) | |||
if not isinstance(ascending, (tuple, list)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_list_like
indexer = _ensure_platform_int(indexer) | ||
else: | ||
from pandas.core.sorting import nargsort | ||
for i, by_step in enumerate(by[::-1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do
new_data = self._data
prior to the start of the loop (then just use new_data
it always exists), then you don't need the if/else conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I had to change to new_data = self because later I use an xs indexer. There might be nicer way of writing the code (without xs) but I need to dive more into internals of Pandas, this is my first attempt to help with development.
- fixed doc syntax - added asv tests - changed initialization of `new_data` variable. Suggested change to `self._data` was not suitable, because later we use `new_data.xs` so `new_data` should be NDFrame instance.
Thanks @jreback for feedback and hints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls show the resulting comparisons vs master
|
||
def setup(self): | ||
N = 1000000 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't need to be so big 100000 is ok
# self.df.iloc[10:20] = np.nan | ||
|
||
def time_frame_sort_values_by_two_columns(self): | ||
self.df.sort_values(by=['D', 'E']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a few more that only have those dtypes
e.g. only string only int only dt and use categorical string as well
can you rebase / update |
Hello @mficek! Thanks for updating the PR.
Comment last updated on August 29, 2017 at 14:30 Hours UTC |
Speedup achieved for sorting by multiple columns: lexsort over multiple columns replaced with mergesorting columns in reverse order than specified in "by" argument of "sort_values".
Please find below some results from ASV benchmark. As I observed before, when string-typed column is sorted, then my proposed method is slow. I have to find some time and benchmark the sorting code to find out if that is caused by copying of underlaying data or rather missing lexsort, which is in original version. I'm not sure if making some "hybrid" sort based on column data-types would be a good way to continue.
|
@mficek The cost of factorization (IOW we turn columns into Categoricals) is higher for strings, particularly less dense strings. We need to hash them which is compartively slow (compared to integer factoriation). Though this shouldn't impact too much. Can you show a profile output for sorting say [strings|ints] and [ints|ints] with the same cardinality.? |
So the result of sorting str|int and int|int with same cardinality is as follows:
Not nice. @jreback, any suggestions? |
I profiled the int-int and str-int functions. Bellow is the profiler output. I'll try to figure out why np.argsort takes four times longer for str-int benchmark than for int-int.
|
@mficek hmm that looks odd
|
@jreback I spotted this discussion, observing that np.sort might be faster than np.argsort. Do you think this could be a way? |
@mficek did you have a link w.r.t. #17141 (comment) ? |
Sorry, @jreback, it's here: http://numpy-discussion.10968.n7.nabble.com/argsort-speed-td36423.html |
so i think the speed diff is coming from the partial sortedness that is already present you could could not do the multi sort of this or for all int columns i suppose not sure you can actually use np.sort as we always deal with the indexers (we need to eventually do a take) but it maybe possible to factorize then use np.sort on the factorized values |
pls rebase and update when you can |
@jreback Hi Jeff, I'm running out of time to finish this issue. What I can propose is a "hacky" solution with sort speedup for numeric (and categorical and dates) values only. I suggest to check for objects with fallback to current sorting logic, and sorting as coded now in this branch for numeric-like values. If you agree, I'll proceed with this. Hm? |
@mficek yep that would be totally fine. We can open an issue for the remaining case to investigate later. |
can you rebase and update as above |
- :attr:`Series.dt` no longer performs frequency inference, yielding a large speedup when accessing the attribute (:issue:`17210`) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.21.1
can you rebase |
can you rebase / show asv's |
@jreback Hi Jeff, I give up :-D Today I rebased the code and the speedup simply does not work. For random numbers, I get pretty nice speedup (around 0.3-4 of original code), but for repeated numbers it's twice as long :( I don't have any ideas how to proceed further. Moreover, I don't understand why sorting one numeric column takes longer with my code, I'm not aware of any changes in the logic of sorting for one-column sort.
|
I messed up with the branches, If you want to see the code, I can submit a new PR. Sorry for the hassle. |
@mficek sorry about that. could you submit a PR just with the new asvs? (these have been changed around a lot recently). with fixed seeds. Then we can separately evaluate whether the fix is good or not. |
Speedup achieved for sorting by multiple columns: lexsort over multiple columns
replaced with mergesort of columns in reverse order than specified in "by"
argument of DataFrame.sort_values()
git diff upstream/master -u -- "*.py" | flake8 --diff