Skip to content

[Review] Add fast path for multi-column sorting #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Oct 7, 2021

Conversation

quasiben
Copy link
Contributor

This PR adds an optional fast path for multi-column sorting for libraries like Dask-cuDF was has a work implementation. The implementation currently does not support ascending/na_position kwargs -- we might be able to do something similar to what is currently implemented in _sort_first_column where we check for NAs/Descending. For now, though, if the users needs those options the fall back is to the default implementation.

We can add tests for this PR after #227 is resolved.

Lastly, if we want to resolve multi-column sorting for Dask, we could use similar implementation in dask-cudf which relies on multi-column quantiles . This would mean adding a quantiles method to pandas

In [30]: df = pd.DataFrame({'a': [1, 0, 11, 12, 2], 'b': [1, 2, 3, 4, 5], 'c': [0, 1, 5, 2, 3]})

In [31]: cdf = cudf.from_pandas(df)

In [32]: cdf[['c', 'a']].quantiles([0, 0.5, 1])
Out[32]:
     c   a
0.0  0   1
0.5  2  12
1.0  5  11

In [33]: df[['c', 'a']].quantile([0, 0.5, 1])
Out[33]:
       c     a
0.0  0.0   0.0
0.5  2.0   2.0
1.0  5.0  12.0

In [34]:

cc @rjzamora

@codecov-commenter
Copy link

codecov-commenter commented Aug 27, 2021

Codecov Report

Merging #229 (49d6cf2) into main (4d5f7dd) will decrease coverage by 0.38%.
The diff coverage is 50.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##              main     #229      +/-   ##
===========================================
- Coverage   100.00%   99.61%   -0.39%     
===========================================
  Files           64       64              
  Lines         2590     2609      +19     
  Branches       361      367       +6     
===========================================
+ Hits          2590     2599       +9     
- Misses           0        8       +8     
- Partials         0        2       +2     
Impacted Files Coverage Δ
dask_sql/physical/utils/sort.py 83.05% <50.00%> (-16.95%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d5f7dd...49d6cf2. Read the comment docs.

Copy link
Collaborator

@nils-braun nils-braun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I like that! Once we are sure about the testing (as you have said) we can merge.

Regarding the multi-column sorting in dask: one problem is (as you mentioned) the quantile method, another one is however: how to handle the index, as Dask does currently not support multi-index (multi-column) divisions (as far as I recall). How do you handle this in dask-cudf?

@nils-braun
Copy link
Collaborator

nils-braun commented Aug 28, 2021

(side note: why is the code coverage still 100%... Shouldn't the sort_values function not be implemented? Strange. I am currently not on my laptop, but I will need to check later)

Sorry, stupid me. The code of course works for a single column.

@charlesbluca
Copy link
Collaborator

I'm going to be taking this PR over from @quasiben - could I get workflow approval?

@quasiben
Copy link
Contributor Author

I'm going to be taking this PR over from @quasiben - could I get workflow approval?

Thanks @charlesbluca I approved so you should be able to get CI runs now

@quasiben
Copy link
Contributor Author

quasiben commented Sep 2, 2021

@nils-braun if you have time I think this is ready for another review

@quasiben quasiben changed the title [WIP] Add fast path for multi-column sorting [Review] Add fast path for multi-column sorting Sep 7, 2021
@charlesbluca
Copy link
Collaborator

FYI I am currently working on implementing ascending in Dask's sort_values:

If that gets merged first, we could update the sorting logic here to reflect this change

@charlesbluca
Copy link
Collaborator

In some cases, dask-cudf's sorting doesn't match up with dask-sql's; for example:

import cudf
import pandas as pd
from dask_sql import Context

df = pd.DataFrame({"user_id": [2, 1, 2, 3], "b": [3, 3, 1, 3]})

c = Context()

c.create_table("gdf", df, gpu=True, npartitions=2)
c.create_table("pdf", df, gpu=False, npartitions=2)

got, expect = (c.sql(
    """
SELECT
    *
FROM %s
ORDER BY user_id
""" % s
).compute() for s in ("gdf", "pdf"))

print(got)
print(expect)

   user_id  b
0        1  3
0        2  3
1        2  1
2        3  3
   user_id  b
0        1  3
0        2  1
1        2  3
2        3  3

I'm not sure if the dask-sql's sorting of rows with identical values in the sort-by column is intentional, but for now I worked around this by testing the SQL sorted dask-cudf table against the same table sorted by calling sort_values(by=..., ignore_index=True) directly.

if dask_cudf is not None and isinstance(df, dask_cudf.DataFrame):
if (all(sort_ascending) or not any(sort_ascending)) and not any(
sort_null_first[1:]
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really not a fan of these single partition checks, but this is reflective of cuDF's sort_values behavior, which can only do:

  • a list for ascending with no support for na_position
  • a single boolean for ascending with support for na_position

Ideally we would want cuDF to match Pandas' behavior and support na_position when providing a list for ascending, which would simplify this whole block to:

     if df.npartitions == 1 and not any(sort_null_first[1:]):
         return df.map_partitions(
             M.sort_values,
             by=sort_columns,
             ascending=sort_ascending,
             na_position="first" if sort_null_first[0] else "last",
         )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include these comments in the code so we can go back an fix at a later date. Also, can you file an issue with cuDF to help track support for na_position with a list for ascending ? Hmm, are these cudf or dask-cudf issues ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are cuDF specific - right now ascending sort / null positioning are still WIP in dask-cudf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Ok, then maybe comments with links to those PRs and we can continue iterating

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also something that might be worth discussing is the eventual upstreaming of this logic - in the long run, it might make sense to have these map_partition calls done implicitly in df.sort_values if we see that df.npartitions == 1. This would allow us to simplify the logic here to just a single sort_values call that is optimal regardless of partition count.

I discussed that possibility for Dask briefly here:

dask/dask#8225 (comment)

@quasiben
Copy link
Contributor Author

quasiben commented Oct 7, 2021

rerun tests

@charlesbluca charlesbluca merged commit dcf93ee into dask-contrib:main Oct 7, 2021
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Oct 13, 2021
Closes #9158 

Adds support for descending sort in dask-cudf's `sort_values`, which should open up the use cases for it in dask-contrib/dask-sql#229.

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9250
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants