-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: Slow performance of to_dict (#46470) #46487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Slow performance of to_dict (#46470) #46487
Conversation
397bc47
to
4f0f872
Compare
@rhshadrach there are lots of failures on the builds that seem unrelated to my changes, am I missing something? |
9aa150f
to
70ed030
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please avoid doing the refactoring and the changes in parallel? This makes it really hard to review
@phofl yeah very enough, but as I was writing tests i realised that a few of the other |
+1. @RogerThomas can you leave as one function for now and refactor in a follow up (or vice-versa)? |
@rhshadrach @phofl, sure thing. Just to be sure before I start, do you want me to; a) Keep the code that I added that handles correct dtype coercing for some orient types and just move the code back into the original method |
@RogerThomas - I think it's best to have separate PRs for performance and correctness. Best to get correctness first, then improve performance (in general). |
@rhshadrach sorry not sure i follow, the ticket that this is supposed to close is purely a performance issue. Are you saying i should create a pr with the fix for the other orient types and then continue with this pr to address the slow performance of to_dict records? |
@RogerThomas - yes, that's correct. If they are independent, then any order is okay. If they are not independent, then best practice is to get correct behavior and then work on performance. |
Ok thanks @rhshadrach, do i need to create a separate github issue for the pr to fix the other orient types? |
@RogerThomas - I prefer to, but in general, no. E.g. you can specify the PR # in the whatsnew when doing a bugfix instead of an issue #. |
Ok thanks @rhshadrach, I'll try and get to both in the next few days |
70ed030
to
96ac6fa
Compare
@rhshadrach @phofl i've removed the helper function, i've updated
How to interpret this table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice perf gain! Some requests/questions below. I only make comments on the first case, similar remarks apply to other ones as well. Can you also run asvs for this. From within asv_bench:
asv continuous -f 1.1 upstream/main HEAD -b ^frame_methods.ToDict
pandas/core/frame.py
Outdated
for t in self.itertuples(index=False, name=None) | ||
] | ||
else: | ||
data = [list(t) for t in self.itertuples(index=False, name=None)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share code between any of these cases? e.g. make a helper function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback done
{ | ||
"a": [1, "hello", 3], | ||
"b": [1.1, "world", 3.3], | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this hits all of the new added code?
@jreback would you have time to discuss the above? |
@jreback @rhshadrach any chance we could pick this up, the project I'm working on could really do with the optimisations |
@RogerThomas - certainly; can you resolve conflict and will look. |
…andas-devgh-46470-slow-performance-of-to-dict
@rhshadrach done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RogerThomas - from what I can tell the only outstanding issues are (a) potentially moving the entire to_dict
implementation to pandas.io
and (b) tests. For (a), I think this is a good idea and would make for a good followup if you'd like to tackle it @RogerThomas, otherwise I plan to.
For the tests, I'm seeing ~350 tests that hit the to_dict
of either DataFrame or Series. Some of the more explicit tests of to_dict:
test_to_dict_timestamp
test_to_dict_index_not_unique_with_index_orient
test_to_dict_invalid_orient
test_to_dict_short_orient_raises
test_to_dict
test_to_dict_errors
test_to_dict_not_unique_warning
test_to_dict_box_scalars
test_to_dict_tz
test_to_dict_index_dtypes
test_to_dict_numeric_names
test_to_dict_wide
test_to_dict_orient_dtype
test_to_dict_scalar_constructor_orient_dtype
test_to_dict_mixed_numeric_frame
test_to_dict_orient_tight
test_to_dict_returns_native_types
test_to_dict_index_false_error
test_to_dict_index_false
Looking through these, it appears to me the addition of testing object
dtype gives good coverage of the code here.
I plan to run the to_dict
ASVs on here this evening as a last step in approving, but otherwise looks good to me.
cc @jreback
ASVs look great!
|
Thanks @rhshadrach!! Just to be clear then, is the last blocker moving the entire to_dict method to pandas.io? |
moving to pandas.in is a follow up are there sufficient tests here? |
As far as I can tell, yes. I commented on this in #46487 (review). Are you looking for something more specific? |
@jreback @rhshadrach the tests I added here, I believe, give us decent coverage across a range of dtypes |
@jreback - friendly ping. |
@rhshadrach anything I can do to help speed this up, I'd really like to get it in and am afraid it's going to go stale and fall by the wayside. |
@jreback - friendly ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine
need to move the note
also prob good idea to factor this code outside of frame.py in a follow up
@rhshadrach pls merge when good by you
doc/source/whatsnew/v1.5.0.rst
Outdated
@@ -960,6 +960,7 @@ Performance improvements | |||
- Performance improvement when setting values in a pyarrow backed string array (:issue:`46400`) | |||
- Performance improvement in :func:`factorize` (:issue:`46109`) | |||
- Performance improvement in :class:`DataFrame` and :class:`Series` constructors for extension dtype scalars (:issue:`45854`) | |||
- Performance improvement in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when using any non-object dtypes (:issue:`46470`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to move the note to 2.0
…andas-devgh-46470-slow-performance-of-to-dict
Thanks @jreback I've moved the whatsnew doc to 2.0.0, let me know if there's anything else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @RogerThomas - great work! I've opened #49845 as a followup, would you have any interest tackling this? |
Thanks @rhshadrach, for sure, I'll do that |
Co-authored-by: Roger Thomas <[email protected]> Co-authored-by: RogerThomas <[email protected]>
Improves performance of
to_dict
method for dataframes and seriesFor
orient=index
andorient=list
performance has decreased but this is because, prior to this PR, we were not coercing types to native python types, we now are. (I assume we want this, but maybe not?)to_dict("records")
#46470 (Replace xxxx with the Github issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.