Skip to content

PERF: DataFrame.iloc[int] for EA dtypes #54508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 23, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Aug 12, 2023

Perf improvement in DataFrame.iloc when input is an integer and the dataframe is EA-backed. Most visible on wide frames.

import pandas as pd
import numpy as np

data = np.random.randn(4, 10_000)

df_wide = pd.DataFrame(data, dtype="float64[pyarrow]")
%timeit df_wide.iloc[1]

# 1.33 s ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 98.6 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR


df_wide = pd.DataFrame(data, dtype="Float64")
%timeit df_wide.iloc[1]

# 97.9 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- main
# 51.2 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR

Also visible with DataFrame reductions of EA dtypes:

import pandas as pd
import numpy as np

data = np.random.randn(4, 10_000)

df_wide = pd.DataFrame(data, dtype="float64[pyarrow]")
%timeit df_wide.sum(axis=0)

# 3.16 s ± 57.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
# 1.66 s ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

df_wide = pd.DataFrame(data, dtype="Float64")
%timeit df_wide.sum(axis=0)

# 1.49 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
# 1.17 s ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

@lukemanley lukemanley added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance ExtensionArray Extending pandas with custom dtypes or arrays. labels Aug 12, 2023
@lukemanley lukemanley added this to the 2.1 milestone Aug 12, 2023
cls = dtype.construct_array_type()
result = cls._empty((n,), dtype=dtype)
if isinstance(dtype, ExtensionDtype):
result = np.empty(n, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i expect this will be bad for e.g. DatetimeTZDtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it for DatetimeTZDtype("ns", "UTC") and it seems to be ok - and more performant in that case as well.

I think it's ok since each element is pulled out individually which ensures wrapping in Timestamp.

@jbrockmendel
Copy link
Member

Is the issue that the relevant EA.__setitem__ methods are non-performant?

@lukemanley
Copy link
Member Author

Is the issue that the relevant EA.__setitem__ methods are non-performant?

Yes. In the case pyarrow, really non-performant to iteratively set each element.

@mroeschke mroeschke merged commit e8b9749 into pandas-dev:main Aug 23, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 23, 2023
mroeschke pushed a commit that referenced this pull request Aug 23, 2023
…dtypes) (#54700)

Backport PR #54508: PERF: DataFrame.iloc[int] for EA dtypes

Co-authored-by: Luke Manley <[email protected]>
@jbrockmendel
Copy link
Member

I'm not wild about this. Seems to be papering over a hacky __setitem__ implementation for the ArrowEA, which really should just be immutable.

@lukemanley
Copy link
Member Author

I'm not wild about this. Seems to be papering over a hacky __setitem__ implementation for the ArrowEA, which really should just be immutable.

Fair enough. The __setitem__ performance for ArrrowEA was pretty rough here. It was copying the new array for every element set. We could revert, but it would reintroduce that behavior. Is there some change you would like to see in the near term? (revert, TODO note, special-case ArrowEA, something else?)

@lukemanley lukemanley deleted the perf-fast-xs-ea branch September 6, 2023 00:54
@jbrockmendel
Copy link
Member

Is there some change you would like to see in the near term? (revert, TODO note, special-case ArrowEA, something else?)

Maybe a TODO note pointing back at the relevant part of this thread?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants