Skip to content

[FEA] Support na_position for cudf.DataFrame.sort_values when ascending is sequence of booleans #9400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
charlesbluca opened this issue Oct 7, 2021 · 0 comments · Fixed by #9455
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@charlesbluca
Copy link
Member

Is your feature request related to a problem? Please describe.
It is not currently possible to provide sort_values with a list of boolean for ascending as well as a non-default value for na_position. Doing so raises a warning, and does not take the value of na_position into account:

In [1]: import cudf

In [2]: gdf = cudf.DataFrame({'a': [1, 1, 2, 2, 2, None], 'b': [1, 2, 3, 4, 5, 6]})

In [3]: gdf.sort_values(by=["a", "b"], ascending=[True, True], na_position="first")
/raid/charlesb/dev/rapids/cudf/python/cudf/cudf/core/frame.py:3128: UserWarning: When using a sequence of booleans for `ascending`, `na_position` flag is not yet supported and defaults to treating nulls as greater than all numbers
  warnings.warn(
Out[3]: 
      a  b
0     1  1
1     1  2
2     2  3
3     2  4
4     2  5
5  <NA>  6

Describe the solution you'd like
It would be nice if cuDF had the same behavior as Pandas here, which is able to handle both the list of ascending values and the na_position:

In [4]: pdf = gdf.to_pandas()

In [5]: pdf.sort_values(by=["a", "b"], ascending=[True, True], na_position="first")
Out[5]: 
     a  b
5  NaN  6
0  1.0  1
1  1.0  2
2  2.0  3
3  2.0  4
4  2.0  5

Describe alternatives you've considered
I'm pretty sure cuDF still properly sorts the columns with null values, so it should be possible to split the resulting dataframe up into null and non-null parts and concat it together to achieve the desired null positioning - I imagine this is sub-optimal.

Additional context
This is coming up while doing work on dask-sql sorting - in some cases, we can rely directly on cuDF/Pandas sort_values, but additional checks need to be made for cuDF due to this behavior. See dask-contrib/dask-sql#229 (comment) for more details.

@charlesbluca charlesbluca added feature request New feature or request Needs Triage Need team to review and classify labels Oct 7, 2021
@beckernick beckernick added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Oct 8, 2021
@isVoid isVoid self-assigned this Oct 15, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 16, 2021
closes #9400 

This PR augments `order_by` cython API to accept a list of `null_precedence` arguments per column to support `na_position` working with a list of `ascending`.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9455
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants