PERF: DataFrame.iloc[int] for EA dtypes #54508

lukemanley · 2023-08-12T01:41:30Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.1.0.rst file if fixing a bug or adding a new feature.

Perf improvement in DataFrame.iloc when input is an integer and the dataframe is EA-backed. Most visible on wide frames.

import pandas as pd
import numpy as np

data = np.random.randn(4, 10_000)

df_wide = pd.DataFrame(data, dtype="float64[pyarrow]")
%timeit df_wide.iloc[1]

# 1.33 s ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 98.6 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR


df_wide = pd.DataFrame(data, dtype="Float64")
%timeit df_wide.iloc[1]

# 97.9 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- main
# 51.2 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR

Also visible with DataFrame reductions of EA dtypes:

import pandas as pd
import numpy as np

data = np.random.randn(4, 10_000)

df_wide = pd.DataFrame(data, dtype="float64[pyarrow]")
%timeit df_wide.sum(axis=0)

# 3.16 s ± 57.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
# 1.66 s ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

df_wide = pd.DataFrame(data, dtype="Float64")
%timeit df_wide.sum(axis=0)

# 1.49 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- main
# 1.17 s ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

jbrockmendel · 2023-08-12T08:02:02Z

pandas/core/internals/managers.py

-            cls = dtype.construct_array_type()
-            result = cls._empty((n,), dtype=dtype)
+        if isinstance(dtype, ExtensionDtype):
+            result = np.empty(n, dtype=object)


i expect this will be bad for e.g. DatetimeTZDtype

I just tried it for DatetimeTZDtype("ns", "UTC") and it seems to be ok - and more performant in that case as well.

I think it's ok since each element is pulled out individually which ensures wrapping in Timestamp.

jbrockmendel · 2023-08-12T08:02:40Z

Is the issue that the relevant EA.__setitem__ methods are non-performant?

lukemanley · 2023-08-12T10:03:34Z

Is the issue that the relevant EA.__setitem__ methods are non-performant?

Yes. In the case pyarrow, really non-performant to iteratively set each element.

mroeschke · 2023-08-23T00:23:58Z

Thanks @lukemanley

…dtypes) (#54700) Backport PR #54508: PERF: DataFrame.iloc[int] for EA dtypes Co-authored-by: Luke Manley <[email protected]>

jbrockmendel · 2023-08-31T16:56:44Z

I'm not wild about this. Seems to be papering over a hacky __setitem__ implementation for the ArrowEA, which really should just be immutable.

lukemanley · 2023-09-01T01:18:15Z

I'm not wild about this. Seems to be papering over a hacky __setitem__ implementation for the ArrowEA, which really should just be immutable.

Fair enough. The __setitem__ performance for ArrrowEA was pretty rough here. It was copying the new array for every element set. We could revert, but it would reintroduce that behavior. Is there some change you would like to see in the near term? (revert, TODO note, special-case ArrowEA, something else?)

jbrockmendel · 2023-09-06T15:34:34Z

Is there some change you would like to see in the near term? (revert, TODO note, special-case ArrowEA, something else?)

Maybe a TODO note pointing back at the relevant part of this thread?

improve perf of fast_xs for EA dtypes

3acae7b

lukemanley added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance ExtensionArray Extending pandas with custom dtypes or arrays. labels Aug 12, 2023

lukemanley added this to the 2.1 milestone Aug 12, 2023

whatsnew

8c4bfbc

jbrockmendel reviewed Aug 12, 2023

View reviewed changes

lukemanley mentioned this pull request Aug 16, 2023

ENH: enable setitem dim2 test to work for EA with complex128 dtype #54445

Open

3 tasks

mroeschke approved these changes Aug 23, 2023

View reviewed changes

mroeschke merged commit e8b9749 into pandas-dev:main Aug 23, 2023

meeseeksmachine mentioned this pull request Aug 23, 2023

Backport PR #54508 on branch 2.1.x (PERF: DataFrame.iloc[int] for EA dtypes) #54700

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 23, 2023

Backport PR pandas-dev#54508: PERF: DataFrame.iloc[int] for EA dtypes

4fa0d5f

mroeschke pushed a commit that referenced this pull request Aug 23, 2023

Backport PR #54508 on branch 2.1.x (PERF: DataFrame.iloc[int] for EA …

968b517

…dtypes) (#54700) Backport PR #54508: PERF: DataFrame.iloc[int] for EA dtypes Co-authored-by: Luke Manley <[email protected]>

lukemanley deleted the perf-fast-xs-ea branch September 6, 2023 00:54

lukemanley mentioned this pull request Sep 6, 2023

Add TODO note to BlockManager.fast_xs for EA dtypes #55039

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: DataFrame.iloc[int] for EA dtypes #54508

PERF: DataFrame.iloc[int] for EA dtypes #54508

lukemanley commented Aug 12, 2023 •

edited

Loading

jbrockmendel Aug 12, 2023

lukemanley Aug 12, 2023

jbrockmendel commented Aug 12, 2023

lukemanley commented Aug 12, 2023

mroeschke commented Aug 23, 2023

jbrockmendel commented Aug 31, 2023

lukemanley commented Sep 1, 2023

jbrockmendel commented Sep 6, 2023

PERF: DataFrame.iloc[int] for EA dtypes #54508

PERF: DataFrame.iloc[int] for EA dtypes #54508

Conversation

lukemanley commented Aug 12, 2023 • edited Loading

jbrockmendel Aug 12, 2023

Choose a reason for hiding this comment

lukemanley Aug 12, 2023

Choose a reason for hiding this comment

jbrockmendel commented Aug 12, 2023

lukemanley commented Aug 12, 2023

mroeschke commented Aug 23, 2023

jbrockmendel commented Aug 31, 2023

lukemanley commented Sep 1, 2023

jbrockmendel commented Sep 6, 2023

lukemanley commented Aug 12, 2023 •

edited

Loading