Skip to content

[REVIEW] Fast path when possible for non numeric aggregation #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 12, 2021

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Sep 3, 2021

This PR adds fast path when possible for non numeric aggregation. The PR speeds up aggregations by avoiding the apply based path here.

Motivating Example

%%time
df =  pd.DataFrame({'grp_col': np.arange(0,100_000).repeat(3),
                    'p_date': [np.datetime64('2019-08', 'D'),np.datetime64('2019-07', 'D'),np.datetime64('2018-07', 'D')]*100_000,
                     })

df = dd.from_pandas(df, npartitions=4)
dc.create_table('table_1', df)

query = '''
        select 
            MIN(p_date) as f_date
        from 
            table_1
        group by
            grp_col
        '''
res = dc.sql(query)
res.compute()

Mainline

CPU times: user 39.5 s, sys: 451 ms, total: 39.9 s
Wall time: 37.1 s

With this PR (236):

CPU times: user 3.84 s, sys: 180 ms, total: 4.02 s
Wall time: 1.58 s

@VibhuJawa VibhuJawa marked this pull request as ready for review September 3, 2021 18:43
@VibhuJawa VibhuJawa changed the title [WIP] Fast path when possible for non numeric aggregation [REVIEW] Fast path when possible for non numeric aggregation Sep 3, 2021
@VibhuJawa
Copy link
Collaborator Author

The PR should be ready for an inital review.

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great and have a huge performance gains for these specific cases. Added a few comments inline.

It would be great to include some test cases to test_groupby as well that add code coverage for this case.

@VibhuJawa VibhuJawa changed the title [REVIEW] Fast path when possible for non numeric aggregation [WIP] Fast path when possible for non numeric aggregation Sep 9, 2021
@VibhuJawa
Copy link
Collaborator Author

VibhuJawa commented Sep 23, 2021

The changes look great and have a huge performance gains for these specific cases. Added a few comments inline.

Thanks , have addressed all of your reviews, I think this is ready for another review.

It would be great to include some test cases to test_groupby as well that add code coverage for this case.

I have included test cases for StringArray here, ea48c7a in test_compatibility.py .

Adding similar tests was not possible due to not being able to test equality with datetime columns due to a pandas issue with testing equality b/w string and datetime columns. (Raised the issue here: pd.testing pandas-dev/pandas#43707) .

Added Datetime tests with commit 6598259

For test cases corresponding to cudf/dask_cudf , i think we should first let https://github.com/dask-contrib/dask-sql/pull/229/files land to prevent any merge conflicts, Happy to figure out a workaround if we would really like dask-cudf specific test cases.

@VibhuJawa VibhuJawa changed the title [WIP] Fast path when possible for non numeric aggregation [REVIEW] Fast path when possible for non numeric aggregation Sep 23, 2021
@charlesbluca
Copy link
Collaborator

rerun tests

@codecov-commenter
Copy link

codecov-commenter commented Oct 12, 2021

Codecov Report

Merging #236 (3eb6f4b) into main (767a23c) will decrease coverage by 0.11%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #236      +/-   ##
==========================================
- Coverage   99.61%   99.50%   -0.12%     
==========================================
  Files          64       64              
  Lines        2608     2622      +14     
  Branches      367      372       +5     
==========================================
+ Hits         2598     2609      +11     
- Misses          8        9       +1     
- Partials        2        4       +2     
Impacted Files Coverage Δ
dask_sql/physical/rel/logical/aggregate.py 97.87% <85.71%> (-2.13%) ⬇️
dask_sql/physical/rex/core/call.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 767a23c...3eb6f4b. Read the comment docs.

@galipremsagar
Copy link
Collaborator

rerun tests

Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks for the work here @VibhuJawa 😄

@charlesbluca charlesbluca merged commit 2868ba3 into dask-contrib:main Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants