PERF: Slow performance of to_dict (#46470) #46487

RogerThomas · 2022-03-23T20:47:48Z

Improves performance of to_dict method for dataframes and series

For orient=index and orient=list performance has decreased but this is because, prior to this PR, we were not coercing types to native python types, we now are. (I assume we want this, but maybe not?)

N: 5,000,000

With objects (3 int cols, 3 float fields, 3 string fields)
dict: old: 16.27s, new: 16.50s
list: old: 1.33s, new: 6.28s
split: old: 20.40s, new: 19.77s
records: old: 27.31s, new: 19.75s
index: old: 17.20s, new: 28.44s
tight: old: 20.95s, new: 18.88s

Without objects (3 int cols, 3 float fields)
dict: old: 7.49s, new: 7.48s
list: old: 1.04s, new: 1.07s
split: old: 12.98s, new: 6.32s
records: old: 17.64s, new: 7.20s
index: old: 14.13s, new: 14.44s
tight: old: 12.87s, new: 6.62s

closes PERF: Slow performance of to_dict("records") #46470 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

RogerThomas · 2022-03-25T08:30:44Z

@rhshadrach there are lots of failures on the builds that seem unrelated to my changes, am I missing something?

phofl

Could you please avoid doing the refactoring and the changes in parallel? This makes it really hard to review

RogerThomas · 2022-04-08T17:47:32Z

@phofl yeah very enough, but as I was writing tests i realised that a few of the other orients weren't doing the coercing to the correct dtypes, so i started to fix them in this PR as my head was in the space and then that method started getting very long. But yeah take your point for sure

rhshadrach · 2022-04-08T20:15:05Z

Could you please avoid doing the refactoring and the changes in parallel? This makes it really hard to review

+1. @RogerThomas can you leave as one function for now and refactor in a follow up (or vice-versa)?

RogerThomas · 2022-04-11T09:26:30Z

@rhshadrach @phofl, sure thing. Just to be sure before I start, do you want me to;

a) Keep the code that I added that handles correct dtype coercing for some orient types and just move the code back into the original method
or
b) Just add the code that speeds up the to_dict orient type, move this code back into the original method and leave the other orient types alone?

rhshadrach · 2022-04-11T21:11:56Z

@RogerThomas - I think it's best to have separate PRs for performance and correctness. Best to get correctness first, then improve performance (in general).

RogerThomas · 2022-04-11T21:23:48Z

@rhshadrach sorry not sure i follow, the ticket that this is supposed to close is purely a performance issue. Are you saying i should create a pr with the fix for the other orient types and then continue with this pr to address the slow performance of to_dict records?

rhshadrach · 2022-04-11T21:29:24Z

@RogerThomas - yes, that's correct. If they are independent, then any order is okay. If they are not independent, then best practice is to get correct behavior and then work on performance.

RogerThomas · 2022-04-11T21:32:15Z

Ok thanks @rhshadrach, do i need to create a separate github issue for the pr to fix the other orient types?

rhshadrach · 2022-04-11T21:36:54Z

@RogerThomas - I prefer to, but in general, no. E.g. you can specify the PR # in the whatsnew when doing a bugfix instead of an issue #.

RogerThomas · 2022-04-11T21:38:59Z

Ok thanks @rhshadrach, I'll try and get to both in the next few days

RogerThomas · 2022-04-22T12:08:04Z

@rhshadrach @phofl i've removed the helper function, i've updated tight, records, split, list and index orient types to only use maybe_box_native when neccessary leading to decent performance improvements which can be seen in the table below which gives the % speed up for each orient type on a 1,000,000 row dataframe with 10 columns over an average of 5 to_dict calls for each orient type and # of object columns split

#Object Cols	dict	list	split	records	index	tight
0	1.367	29.082	76.557	61.491	48.133	45.578
1	0.013	38.321	66.922	34.831	43.742	38.486
2	0.175	35.425	56.186	31.715	38.07	30.966
3	0.036	31.251	47.548	27.564	20.055	24.693
4	-0.176	28.231	38.317	26.378	22.535	20.204
5	-0.132	26.137	32.546	23.783	18.309	12.24
6	2.231	23.064	22.454	19.934	12.076	10.85
7	-1.064	19.406	16.208	18.046	10.371	6.304
8	-1.623	17.361	11.56	16.953	-2.514	4.505
9	-0.752	13.337	5.26	12.818	-0.294	-1.968
10	-0.378	1.603	-0.339	-0.871	3.154	-1.243
mean	-0.028	23.929	33.929	24.786	19.421	17.329
min	-1.623	1.603	-0.339	-0.871	-2.514	-1.968
max	2.231	38.321	76.557	61.491	48.133	45.578

How to interpret this table:
For example, we see a 35.425 % speed up for a 1 million row, 10 column dataframe averaged over 5 to_dict(orient="list") runs where 2 of the columns are object types and the other 8 are float columns.

rhshadrach

Very nice perf gain! Some requests/questions below. I only make comments on the first case, similar remarks apply to other ones as well. Can you also run asvs for this. From within asv_bench:

asv continuous -f 1.1 upstream/main HEAD -b ^frame_methods.ToDict

doc/source/whatsnew/v1.5.0.rst

pandas/core/frame.py

jreback · 2022-04-29T20:48:15Z

pandas/core/frame.py

+                    for t in self.itertuples(index=False, name=None)
+                ]
+            else:
+                data = [list(t) for t in self.itertuples(index=False, name=None)]


can you share code between any of these cases? e.g. make a helper function

@jreback done

pandas/core/frame.py

jreback · 2022-05-11T01:12:39Z

pandas/tests/frame/methods/test_to_dict.py

+                {
+                    "a": [1, "hello", 3],
+                    "b": [1.1, "world", 3.3],
+                },


this hits all of the new added code?

RogerThomas · 2022-09-05T13:10:32Z

@jreback would you have time to discuss the above?

RogerThomas · 2022-10-25T09:21:45Z

@jreback @rhshadrach any chance we could pick this up, the project I'm working on could really do with the optimisations

rhshadrach · 2022-11-01T20:34:15Z

@RogerThomas - certainly; can you resolve conflict and will look.

…andas-devgh-46470-slow-performance-of-to-dict

RogerThomas · 2022-11-01T21:40:25Z

@rhshadrach done

rhshadrach

Thanks @RogerThomas - from what I can tell the only outstanding issues are (a) potentially moving the entire to_dict implementation to pandas.io and (b) tests. For (a), I think this is a good idea and would make for a good followup if you'd like to tackle it @RogerThomas, otherwise I plan to.

For the tests, I'm seeing ~350 tests that hit the to_dict of either DataFrame or Series. Some of the more explicit tests of to_dict:

test_to_dict_timestamp
test_to_dict_index_not_unique_with_index_orient
test_to_dict_invalid_orient
test_to_dict_short_orient_raises
test_to_dict
test_to_dict_errors
test_to_dict_not_unique_warning
test_to_dict_box_scalars
test_to_dict_tz
test_to_dict_index_dtypes
test_to_dict_numeric_names
test_to_dict_wide
test_to_dict_orient_dtype
test_to_dict_scalar_constructor_orient_dtype
test_to_dict_mixed_numeric_frame
test_to_dict_orient_tight
test_to_dict_returns_native_types
test_to_dict_index_false_error
test_to_dict_index_false

Looking through these, it appears to me the addition of testing object dtype gives good coverage of the code here.

I plan to run the to_dict ASVs on here this evening as a last step in approving, but otherwise looks good to me.

cc @jreback

rhshadrach · 2022-11-06T14:53:41Z

ASVs look great!

       before           after         ratio
     [a7da45d8]       [aa9863d0]
-      28.2±0.4ms       20.8±0.5ms     0.74  frame_methods.ToDict.time_to_dict_datetimelike('index')
-     20.0±0.03ms      14.5±0.03ms     0.72  frame_methods.ToDict.time_to_dict_datetimelike('list')
-      23.8±0.6ms      17.1±0.09ms     0.72  frame_methods.ToDict.time_to_dict_datetimelike('dict')
-     11.8±0.06ms       8.24±0.2ms     0.70  frame_methods.ToDict.time_to_dict_ints('index')
-      23.1±0.7ms       15.0±0.4ms     0.65  frame_methods.ToDict.time_to_dict_datetimelike('split')
-      28.5±0.5ms       17.5±0.3ms     0.61  frame_methods.ToDict.time_to_dict_datetimelike('records')
-     7.98±0.01ms      4.37±0.03ms     0.55  frame_methods.ToDict.time_to_dict_ints('dict')
-      7.05±0.1ms      2.55±0.04ms     0.36  frame_methods.ToDict.time_to_dict_ints('split')
-      12.9±0.1ms       4.51±0.1ms     0.35  frame_methods.ToDict.time_to_dict_ints('records')
-     3.74±0.04ms        552±0.9μs     0.15  frame_methods.ToDict.time_to_dict_ints('list')

RogerThomas · 2022-11-07T21:13:59Z

Thanks @rhshadrach!! Just to be clear then, is the last blocker moving the entire to_dict method to pandas.io?

jreback · 2022-11-08T01:16:30Z

moving to pandas.in is a follow up

are there sufficient tests here?

rhshadrach · 2022-11-08T01:47:47Z

are there sufficient tests here?

As far as I can tell, yes. I commented on this in #46487 (review). Are you looking for something more specific?

RogerThomas · 2022-11-08T06:54:10Z

@jreback @rhshadrach the tests I added here, I believe, give us decent coverage across a range of dtypes

rhshadrach · 2022-11-13T15:08:38Z

@jreback - friendly ping.

RogerThomas · 2022-11-17T17:33:24Z

@rhshadrach anything I can do to help speed this up, I'd really like to get it in and am afraid it's going to go stale and fall by the wayside.

rhshadrach · 2022-11-22T02:52:26Z

@jreback - friendly ping.

jreback

looks fine
need to move the note
also prob good idea to factor this code outside of frame.py in a follow up
@rhshadrach pls merge when good by you

jreback · 2022-11-22T02:53:29Z

doc/source/whatsnew/v1.5.0.rst

@@ -960,6 +960,7 @@ Performance improvements
 - Performance improvement when setting values in a pyarrow backed string array (:issue:`46400`)
 - Performance improvement in :func:`factorize` (:issue:`46109`)
 - Performance improvement in :class:`DataFrame` and :class:`Series` constructors for extension dtype scalars (:issue:`45854`)
+- Performance improvement in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when using any non-object dtypes (:issue:`46470`)


need to move the note to 2.0

…andas-devgh-46470-slow-performance-of-to-dict

RogerThomas · 2022-11-22T09:33:04Z

Thanks @jreback I've moved the whatsnew doc to 2.0.0, let me know if there's anything else

rhshadrach

lgtm

rhshadrach · 2022-11-22T22:02:13Z

Thanks @RogerThomas - great work! I've opened #49845 as a followup, would you have any interest tackling this?

RogerThomas · 2022-11-22T22:11:03Z

Thanks @rhshadrach, for sure, I'll do that

Co-authored-by: Roger Thomas <[email protected]> Co-authored-by: RogerThomas <[email protected]>

RogerThomas force-pushed the perf-gh-46470-slow-performance-of-to-dict branch from 397bc47 to 4f0f872 Compare March 23, 2022 21:15

RogerThomas force-pushed the perf-gh-46470-slow-performance-of-to-dict branch from 9aa150f to 70ed030 Compare March 25, 2022 08:32

phofl reviewed Apr 8, 2022

View reviewed changes

rhshadrach added IO Data IO issues that don't fit into a more specific label Performance Memory or execution speed performance labels Apr 22, 2022

Roger Thomas added 3 commits April 22, 2022 10:33

PERF: Slow performance of to_dict (pandas-dev#46470)

8d93fec

Update

d9f9786

Update

96ac6fa

RogerThomas force-pushed the perf-gh-46470-slow-performance-of-to-dict branch from 70ed030 to 96ac6fa Compare April 22, 2022 11:59

Clean up

3c596f7

rhshadrach requested changes Apr 23, 2022

View reviewed changes

Address PR comments

57c95eb

jreback reviewed Apr 27, 2022

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

Roger Thomas added 2 commits April 29, 2022 16:37

Address PR comments

e07f02c

Use as set

d2da86b

jreback reviewed Apr 29, 2022

View reviewed changes

Add helper function

0c0481d

jreback requested changes May 11, 2022

View reviewed changes

Add comment and types

1d68f83

Merge branch 'main' of https://github.com/pandas-dev/pandas into perf-p…

aa9863d

…andas-devgh-46470-slow-performance-of-to-dict

rhshadrach reviewed Nov 5, 2022

View reviewed changes

jreback approved these changes Nov 22, 2022

View reviewed changes

RogerThomas added 3 commits November 22, 2022 09:25

Move doc to 2.0.0

ecfffa6

Merge branch 'main' of https://github.com/pandas-dev/pandas into perf-p…

6487cc9

…andas-devgh-46470-slow-performance-of-to-dict

Revert

36e22c0

rhshadrach approved these changes Nov 22, 2022

View reviewed changes

rhshadrach merged commit 2683b71 into pandas-dev:main Nov 22, 2022

rhshadrach mentioned this pull request Nov 22, 2022

REF: Move to_dict implementation to pandas.core.methods.to_dict #49845

Closed

phofl added this to the 2.0 milestone Nov 23, 2022

mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022

PERF: Slow performance of to_dict (pandas-dev#46487)

3cb366f

Co-authored-by: Roger Thomas <[email protected]> Co-authored-by: RogerThomas <[email protected]>

staadecker mentioned this pull request Dec 6, 2022

PERF: 34% faster Series.to_dict #50089

Merged

kostyafarber mentioned this pull request Dec 14, 2022

REF: refactor implementation of to_dict out of frame.py and into core.methods.to_dict #50253

Merged

5 tasks

TendouArisu mentioned this pull request Mar 1, 2024

Potential performance issue: .to_dict("records") method slow in pandas below 2.0 moj-analytical-services/splink#2017

Closed

TendouArisu mentioned this pull request Mar 20, 2024

Potential performance issue: .to_dict("records") method slow in pandas below 2.0 GoogleCloudPlatform/professional-services-data-validator#1118

Closed

Uh oh!

PERF: Slow performance of to_dict (#46470) #46487

PERF: Slow performance of to_dict (#46470) #46487

Uh oh!

Conversation

RogerThomas commented Mar 23, 2022 • edited by phofl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RogerThomas commented Mar 25, 2022

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

RogerThomas commented Apr 8, 2022

Uh oh!

rhshadrach commented Apr 8, 2022

Uh oh!

RogerThomas commented Apr 11, 2022

Uh oh!

rhshadrach commented Apr 11, 2022

Uh oh!

RogerThomas commented Apr 11, 2022

Uh oh!

rhshadrach commented Apr 11, 2022

Uh oh!

RogerThomas commented Apr 11, 2022

Uh oh!

rhshadrach commented Apr 11, 2022

Uh oh!

RogerThomas commented Apr 11, 2022

Uh oh!

RogerThomas commented Apr 22, 2022

Uh oh!

rhshadrach left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

RogerThomas May 10, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback May 11, 2022

Choose a reason for hiding this comment

Uh oh!

RogerThomas commented Sep 5, 2022

Uh oh!

RogerThomas commented Oct 25, 2022

Uh oh!

rhshadrach commented Nov 1, 2022

Uh oh!

RogerThomas commented Nov 1, 2022

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Nov 6, 2022

Uh oh!

RogerThomas commented Nov 7, 2022

Uh oh!

jreback commented Nov 8, 2022

Uh oh!

rhshadrach commented Nov 8, 2022

Uh oh!

RogerThomas commented Nov 8, 2022

Uh oh!

rhshadrach commented Nov 13, 2022

Uh oh!

RogerThomas commented Nov 17, 2022

Uh oh!

rhshadrach commented Nov 22, 2022

Uh oh!

RogerThomas commented Mar 23, 2022 •

edited by phofl

Loading

rhshadrach left a comment •

edited

Loading