Skip to content

PERF: Slow performance of to_dict (#46470) #46487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

RogerThomas
Copy link
Contributor

@RogerThomas RogerThomas commented Mar 23, 2022

Improves performance of to_dict method for dataframes and series

For orient=index and orient=list performance has decreased but this is because, prior to this PR, we were not coercing types to native python types, we now are. (I assume we want this, but maybe not?)

N: 5,000,000

With objects (3 int cols, 3 float fields, 3 string fields)
dict: old: 16.27s, new: 16.50s
list: old: 1.33s, new: 6.28s
split: old: 20.40s, new: 19.77s
records: old: 27.31s, new: 19.75s
index: old: 17.20s, new: 28.44s
tight: old: 20.95s, new: 18.88s

Without objects (3 int cols, 3 float fields)
dict: old: 7.49s, new: 7.48s
list: old: 1.04s, new: 1.07s
split: old: 12.98s, new: 6.32s
records: old: 17.64s, new: 7.20s
index: old: 14.13s, new: 14.44s
tight: old: 12.87s, new: 6.62s

@RogerThomas RogerThomas force-pushed the perf-gh-46470-slow-performance-of-to-dict branch from 397bc47 to 4f0f872 Compare March 23, 2022 21:15
@RogerThomas
Copy link
Contributor Author

@rhshadrach there are lots of failures on the builds that seem unrelated to my changes, am I missing something?

@RogerThomas RogerThomas force-pushed the perf-gh-46470-slow-performance-of-to-dict branch from 9aa150f to 70ed030 Compare March 25, 2022 08:32
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please avoid doing the refactoring and the changes in parallel? This makes it really hard to review

@RogerThomas
Copy link
Contributor Author

@phofl yeah very enough, but as I was writing tests i realised that a few of the other orients weren't doing the coercing to the correct dtypes, so i started to fix them in this PR as my head was in the space and then that method started getting very long. But yeah take your point for sure

@rhshadrach
Copy link
Member

Could you please avoid doing the refactoring and the changes in parallel? This makes it really hard to review

+1. @RogerThomas can you leave as one function for now and refactor in a follow up (or vice-versa)?

@RogerThomas
Copy link
Contributor Author

@rhshadrach @phofl, sure thing. Just to be sure before I start, do you want me to;

a) Keep the code that I added that handles correct dtype coercing for some orient types and just move the code back into the original method
or
b) Just add the code that speeds up the to_dict orient type, move this code back into the original method and leave the other orient types alone?

@rhshadrach
Copy link
Member

@RogerThomas - I think it's best to have separate PRs for performance and correctness. Best to get correctness first, then improve performance (in general).

@RogerThomas
Copy link
Contributor Author

@rhshadrach sorry not sure i follow, the ticket that this is supposed to close is purely a performance issue. Are you saying i should create a pr with the fix for the other orient types and then continue with this pr to address the slow performance of to_dict records?

@rhshadrach
Copy link
Member

@RogerThomas - yes, that's correct. If they are independent, then any order is okay. If they are not independent, then best practice is to get correct behavior and then work on performance.

@RogerThomas
Copy link
Contributor Author

Ok thanks @rhshadrach, do i need to create a separate github issue for the pr to fix the other orient types?

@rhshadrach
Copy link
Member

@RogerThomas - I prefer to, but in general, no. E.g. you can specify the PR # in the whatsnew when doing a bugfix instead of an issue #.

@RogerThomas
Copy link
Contributor Author

Ok thanks @rhshadrach, I'll try and get to both in the next few days

@rhshadrach rhshadrach added IO Data IO issues that don't fit into a more specific label Performance Memory or execution speed performance labels Apr 22, 2022
@RogerThomas RogerThomas force-pushed the perf-gh-46470-slow-performance-of-to-dict branch from 70ed030 to 96ac6fa Compare April 22, 2022 11:59
@RogerThomas
Copy link
Contributor Author

@rhshadrach @phofl i've removed the helper function, i've updated tight, records, split, list and index orient types to only use maybe_box_native when neccessary leading to decent performance improvements which can be seen in the table below which gives the % speed up for each orient type on a 1,000,000 row dataframe with 10 columns over an average of 5 to_dict calls for each orient type and # of object columns split

#Object Cols dict list split records index tight
0 1.367 29.082 76.557 61.491 48.133 45.578
1 0.013 38.321 66.922 34.831 43.742 38.486
2 0.175 35.425 56.186 31.715 38.07 30.966
3 0.036 31.251 47.548 27.564 20.055 24.693
4 -0.176 28.231 38.317 26.378 22.535 20.204
5 -0.132 26.137 32.546 23.783 18.309 12.24
6 2.231 23.064 22.454 19.934 12.076 10.85
7 -1.064 19.406 16.208 18.046 10.371 6.304
8 -1.623 17.361 11.56 16.953 -2.514 4.505
9 -0.752 13.337 5.26 12.818 -0.294 -1.968
10 -0.378 1.603 -0.339 -0.871 3.154 -1.243
mean -0.028 23.929 33.929 24.786 19.421 17.329
min -1.623 1.603 -0.339 -0.871 -2.514 -1.968
max 2.231 38.321 76.557 61.491 48.133 45.578

How to interpret this table:
For example, we see a 35.425 % speed up for a 1 million row, 10 column dataframe averaged over 5 to_dict(orient="list") runs where 2 of the columns are object types and the other 8 are float columns.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice perf gain! Some requests/questions below. I only make comments on the first case, similar remarks apply to other ones as well. Can you also run asvs for this. From within asv_bench:

asv continuous -f 1.1 upstream/main HEAD -b ^frame_methods.ToDict

for t in self.itertuples(index=False, name=None)
]
else:
data = [list(t) for t in self.itertuples(index=False, name=None)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share code between any of these cases? e.g. make a helper function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback done

{
"a": [1, "hello", 3],
"b": [1.1, "world", 3.3],
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hits all of the new added code?

@RogerThomas
Copy link
Contributor Author

@jreback would you have time to discuss the above?

@RogerThomas
Copy link
Contributor Author

@jreback @rhshadrach any chance we could pick this up, the project I'm working on could really do with the optimisations

@rhshadrach
Copy link
Member

@RogerThomas - certainly; can you resolve conflict and will look.

@RogerThomas
Copy link
Contributor Author

@rhshadrach done

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RogerThomas - from what I can tell the only outstanding issues are (a) potentially moving the entire to_dict implementation to pandas.io and (b) tests. For (a), I think this is a good idea and would make for a good followup if you'd like to tackle it @RogerThomas, otherwise I plan to.

For the tests, I'm seeing ~350 tests that hit the to_dict of either DataFrame or Series. Some of the more explicit tests of to_dict:

test_to_dict_timestamp
test_to_dict_index_not_unique_with_index_orient
test_to_dict_invalid_orient
test_to_dict_short_orient_raises
test_to_dict
test_to_dict_errors
test_to_dict_not_unique_warning
test_to_dict_box_scalars
test_to_dict_tz
test_to_dict_index_dtypes
test_to_dict_numeric_names
test_to_dict_wide
test_to_dict_orient_dtype
test_to_dict_scalar_constructor_orient_dtype
test_to_dict_mixed_numeric_frame
test_to_dict_orient_tight
test_to_dict_returns_native_types
test_to_dict_index_false_error
test_to_dict_index_false

Looking through these, it appears to me the addition of testing object dtype gives good coverage of the code here.

I plan to run the to_dict ASVs on here this evening as a last step in approving, but otherwise looks good to me.

cc @jreback

@rhshadrach
Copy link
Member

ASVs look great!

       before           after         ratio
     [a7da45d8]       [aa9863d0]
-      28.2±0.4ms       20.8±0.5ms     0.74  frame_methods.ToDict.time_to_dict_datetimelike('index')
-     20.0±0.03ms      14.5±0.03ms     0.72  frame_methods.ToDict.time_to_dict_datetimelike('list')
-      23.8±0.6ms      17.1±0.09ms     0.72  frame_methods.ToDict.time_to_dict_datetimelike('dict')
-     11.8±0.06ms       8.24±0.2ms     0.70  frame_methods.ToDict.time_to_dict_ints('index')
-      23.1±0.7ms       15.0±0.4ms     0.65  frame_methods.ToDict.time_to_dict_datetimelike('split')
-      28.5±0.5ms       17.5±0.3ms     0.61  frame_methods.ToDict.time_to_dict_datetimelike('records')
-     7.98±0.01ms      4.37±0.03ms     0.55  frame_methods.ToDict.time_to_dict_ints('dict')
-      7.05±0.1ms      2.55±0.04ms     0.36  frame_methods.ToDict.time_to_dict_ints('split')
-      12.9±0.1ms       4.51±0.1ms     0.35  frame_methods.ToDict.time_to_dict_ints('records')
-     3.74±0.04ms        552±0.9μs     0.15  frame_methods.ToDict.time_to_dict_ints('list')

@RogerThomas
Copy link
Contributor Author

Thanks @rhshadrach!! Just to be clear then, is the last blocker moving the entire to_dict method to pandas.io?

@jreback
Copy link
Contributor

jreback commented Nov 8, 2022

moving to pandas.in is a follow up

are there sufficient tests here?

@rhshadrach
Copy link
Member

are there sufficient tests here?

As far as I can tell, yes. I commented on this in #46487 (review). Are you looking for something more specific?

@RogerThomas
Copy link
Contributor Author

@jreback @rhshadrach the tests I added here, I believe, give us decent coverage across a range of dtypes

@rhshadrach
Copy link
Member

@jreback - friendly ping.

@RogerThomas
Copy link
Contributor Author

@rhshadrach anything I can do to help speed this up, I'd really like to get it in and am afraid it's going to go stale and fall by the wayside.

@rhshadrach
Copy link
Member

@jreback - friendly ping.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine
need to move the note
also prob good idea to factor this code outside of frame.py in a follow up
@rhshadrach pls merge when good by you

@@ -960,6 +960,7 @@ Performance improvements
- Performance improvement when setting values in a pyarrow backed string array (:issue:`46400`)
- Performance improvement in :func:`factorize` (:issue:`46109`)
- Performance improvement in :class:`DataFrame` and :class:`Series` constructors for extension dtype scalars (:issue:`45854`)
- Performance improvement in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when using any non-object dtypes (:issue:`46470`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to move the note to 2.0

@RogerThomas
Copy link
Contributor Author

Thanks @jreback I've moved the whatsnew doc to 2.0.0, let me know if there's anything else

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach
Copy link
Member

Thanks @RogerThomas - great work! I've opened #49845 as a followup, would you have any interest tackling this?

@RogerThomas
Copy link
Contributor Author

Thanks @rhshadrach, for sure, I'll do that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Slow performance of to_dict("records")
5 participants