[Review] Add fast path for multi-column sorting #229

quasiben · 2021-08-27T14:15:20Z

This PR adds an optional fast path for multi-column sorting for libraries like Dask-cuDF was has a work implementation. The implementation currently does not support ascending/na_position kwargs -- we might be able to do something similar to what is currently implemented in _sort_first_column where we check for NAs/Descending. For now, though, if the users needs those options the fall back is to the default implementation.

We can add tests for this PR after #227 is resolved.

Lastly, if we want to resolve multi-column sorting for Dask, we could use similar implementation in dask-cudf which relies on multi-column quantiles . This would mean adding a quantiles method to pandas

In [30]: df = pd.DataFrame({'a': [1, 0, 11, 12, 2], 'b': [1, 2, 3, 4, 5], 'c': [0, 1, 5, 2, 3]})

In [31]: cdf = cudf.from_pandas(df)

In [32]: cdf[['c', 'a']].quantiles([0, 0.5, 1])
Out[32]:
     c   a
0.0  0   1
0.5  2  12
1.0  5  11

In [33]: df[['c', 'a']].quantile([0, 0.5, 1])
Out[33]:
       c     a
0.0  0.0   0.0
0.5  2.0   2.0
1.0  5.0  12.0

In [34]:

cc @rjzamora

codecov-commenter · 2021-08-27T14:25:45Z

Codecov Report

Merging #229 (49d6cf2) into main (4d5f7dd) will decrease coverage by 0.38%.
The diff coverage is 50.00%.

@@             Coverage Diff             @@
##              main     #229      +/-   ##
===========================================
- Coverage   100.00%   99.61%   -0.39%     
===========================================
  Files           64       64              
  Lines         2590     2609      +19     
  Branches       361      367       +6     
===========================================
+ Hits          2590     2599       +9     
- Misses           0        8       +8     
- Partials         0        2       +2

Impacted Files	Coverage Δ
dask_sql/physical/utils/sort.py	`83.05% <50.00%> (-16.95%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d5f7dd...49d6cf2. Read the comment docs.

dask_sql/physical/utils/sort.py

nils-braun

Cool, I like that! Once we are sure about the testing (as you have said) we can merge.

Regarding the multi-column sorting in dask: one problem is (as you mentioned) the quantile method, another one is however: how to handle the index, as Dask does currently not support multi-index (multi-column) divisions (as far as I recall). How do you handle this in dask-cudf?

dask_sql/physical/utils/sort.py

nils-braun · 2021-08-28T08:20:31Z

~~(side note: why is the code coverage still 100%... Shouldn't the sort_values function not be implemented? Strange. I am currently not on my laptop, but I will need to check later)~~

Sorry, stupid me. The code of course works for a single column.

charlesbluca · 2021-08-31T16:48:15Z

I'm going to be taking this PR over from @quasiben - could I get workflow approval?

quasiben · 2021-08-31T19:03:22Z

I'm going to be taking this PR over from @quasiben - could I get workflow approval?

Thanks @charlesbluca I approved so you should be able to get CI runs now

quasiben · 2021-09-02T18:20:32Z

@nils-braun if you have time I think this is ready for another review

charlesbluca · 2021-09-08T20:00:51Z

FYI I am currently working on implementing ascending in Dask's sort_values:

Implement ascending arg for sort_values dask/dask#8130

If that gets merged first, we could update the sorting logic here to reflect this change

dask_sql/physical/utils/sort.py

charlesbluca · 2021-09-10T14:01:24Z

In some cases, dask-cudf's sorting doesn't match up with dask-sql's; for example:

import cudf
import pandas as pd
from dask_sql import Context

df = pd.DataFrame({"user_id": [2, 1, 2, 3], "b": [3, 3, 1, 3]})

c = Context()

c.create_table("gdf", df, gpu=True, npartitions=2)
c.create_table("pdf", df, gpu=False, npartitions=2)

got, expect = (c.sql(
    """
SELECT
    *
FROM %s
ORDER BY user_id
""" % s
).compute() for s in ("gdf", "pdf"))

print(got)
print(expect)

   user_id  b
0        1  3
0        2  3
1        2  1
2        3  3
   user_id  b
0        1  3
0        2  1
1        2  3
2        3  3

I'm not sure if the dask-sql's sorting of rows with identical values in the sort-by column is intentional, but for now I worked around this by testing the SQL sorted dask-cudf table against the same table sorted by calling sort_values(by=..., ignore_index=True) directly.

charlesbluca · 2021-10-06T14:20:12Z

dask_sql/physical/utils/sort.py

+        if dask_cudf is not None and isinstance(df, dask_cudf.DataFrame):
+            if (all(sort_ascending) or not any(sort_ascending)) and not any(
+                sort_null_first[1:]
+            ):


Really not a fan of these single partition checks, but this is reflective of cuDF's sort_values behavior, which can only do:

a list for ascending with no support for na_position

a single boolean for ascending with support for na_position

Ideally we would want cuDF to match Pandas' behavior and support na_position when providing a list for ascending, which would simplify this whole block to:

if df.npartitions == 1 and not any(sort_null_first[1:]): return df.map_partitions( M.sort_values, by=sort_columns, ascending=sort_ascending, na_position="first" if sort_null_first[0] else "last", )

Can you include these comments in the code so we can go back an fix at a later date. Also, can you file an issue with cuDF to help track support for na_position with a list for ascending ? Hmm, are these cudf or dask-cudf issues ?

These are cuDF specific - right now ascending sort / null positioning are still WIP in dask-cudf

Ah right. Ok, then maybe comments with links to those PRs and we can continue iterating

Also something that might be worth discussing is the eventual upstreaming of this logic - in the long run, it might make sense to have these map_partition calls done implicitly in df.sort_values if we see that df.npartitions == 1. This would allow us to simplify the logic here to just a single sort_values call that is optimal regardless of partition count.

I discussed that possibility for Dask briefly here:

dask/dask#8225 (comment)

quasiben · 2021-10-07T15:16:39Z

rerun tests

Closes #9158 Adds support for descending sort in dask-cudf's `sort_values`, which should open up the use cases for it in dask-contrib/dask-sql#229. Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9250

add fast path for multi-column sorting

5ad586d

lint

8fc9a7b

rjzamora reviewed Aug 27, 2021

View reviewed changes

dask_sql/physical/utils/sort.py Show resolved Hide resolved

nils-braun approved these changes Aug 28, 2021

View reviewed changes

dask_sql/physical/utils/sort.py Outdated Show resolved Hide resolved

dask_sql/physical/utils/sort.py Outdated Show resolved Hide resolved

charlesbluca added 2 commits August 31, 2021 07:06

Merge remote-tracking branch 'upstream/main' into multi-col-sort

ac7bad8

Prevent single column Dask dataframes from calling sort_values

c86cdab

This was referenced Sep 1, 2021

[BUG] sort_values sometimes fails for dask-cudf DataFrames with nulls rapidsai/cudf#9157

Closed

[FEA] ascending parameter for dask_cudf.DataFrame.sort_values rapidsai/cudf#9158

Closed

charlesbluca added 3 commits September 2, 2021 07:01

Wrap dask_cudf import in try/except block

d321ca3

Add test for fast multi column sort

ed65228

Move multi_col_sort contents to apply_sort

76eb2aa

quasiben changed the title ~~[WIP] Add fast path for multi-column sorting~~ [Review] Add fast path for multi-column sorting Sep 7, 2021

ayushdg reviewed Sep 8, 2021

View reviewed changes

dask_sql/physical/utils/sort.py Outdated Show resolved Hide resolved

charlesbluca added 2 commits September 10, 2021 06:51

Ignore index for dask-cudf sorting

927c618

Fix show tables test for cudf enabled fixture

963ad5e

charlesbluca mentioned this pull request Sep 20, 2021

Add ascending parameter for dask-cudf sort_values rapidsai/cudf#9250

Merged

charlesbluca added 2 commits September 30, 2021 08:59

Trigger CI

5fb3c41

Add single partition sort case

645eebf

charlesbluca reviewed Oct 6, 2021

View reviewed changes

Return cudf sorted dataframe without persisting

2e65ca8

charlesbluca mentioned this pull request Oct 6, 2021

Add fast path for multi-column sorting rapidsai/dask-sql#5

Closed

Update nan sort test to reflect Pandas' sort_values ordering

49d6cf2

charlesbluca mentioned this pull request Oct 7, 2021

[FEA] Support na_position for cudf.DataFrame.sort_values when ascending is sequence of booleans rapidsai/cudf#9400

Closed

charlesbluca added 4 commits October 7, 2021 10:57

Add comments tracking relevant [dask-]cudf issues

d55fc84

Merge remote-tracking branch 'upstream/main' into multi-col-sort

55f7725

Move GPU sorting tests to test_sort.py

a185755

Remove unnecessary isin import

9f4868c

charlesbluca merged commit dcf93ee into dask-contrib:main Oct 7, 2021

charlesbluca mentioned this pull request Oct 14, 2021

[Review] Add gpu tests for string functions #256

Merged

[Review] Add fast path for multi-column sorting #229

[Review] Add fast path for multi-column sorting #229

Uh oh!

Conversation

quasiben commented Aug 27, 2021

Uh oh!

codecov-commenter commented Aug 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

nils-braun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nils-braun commented Aug 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlesbluca commented Aug 31, 2021

Uh oh!

quasiben commented Aug 31, 2021

Uh oh!

quasiben commented Sep 2, 2021

Uh oh!

charlesbluca commented Sep 8, 2021

Uh oh!

Uh oh!

charlesbluca commented Sep 10, 2021

Uh oh!

charlesbluca Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

quasiben Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

charlesbluca Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

quasiben Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

charlesbluca Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

quasiben commented Oct 7, 2021

Uh oh!

Uh oh!

codecov-commenter commented Aug 27, 2021 •

edited

Loading

nils-braun commented Aug 28, 2021 •

edited

Loading