Skip to content

DOC: .loc doesn't document that it can take sets or dicts as arguments #42825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Dr-Irv opened this issue Jul 30, 2021 · 19 comments · Fixed by #45052
Closed

DOC: .loc doesn't document that it can take sets or dicts as arguments #42825

Dr-Irv opened this issue Jul 30, 2021 · 19 comments · Fixed by #45052
Labels
Deprecate Functionality to remove in pandas Docs Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 30, 2021

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc

Documentation problem

Surprisingly, the following code works:

import pandas as pd

df = pd.DataFrame([["a", 1], ["b", 2], ["c", 3]], columns=["let", "num"]).set_index(
    "let"
)

dfl = df.loc[set(["a", "b"]), :]

print(dfl)

dd = {"a": 10, "c": 12}

dfd = df.loc[dd, "num"]

print(dfd)

In the documentation, we don't say that you can pass a set or a dict as one of the arguments to .loc .

So either we update the documentation, or maybe we should be raising an Exception?

@Dr-Irv Dr-Irv added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 30, 2021
@attack68
Copy link
Contributor

+1 raising and documenting. also restricting .loc to lists only and not sequence inputs: see #42329 (comment)

@Delengowski
Copy link
Contributor

I'm assuming for speed the implementation of .loc[] tries to iterate rather doing an isinstance check. This is unfortunate because isinstance(dict, typing.Sequence) == False as does isinstance(set, typing.Sequence) iirc.

@rhshadrach rhshadrach added the Indexing Related to indexing on series/frames, not to indexes themselves label Aug 3, 2021
@mroeschke mroeschke added Deprecate Functionality to remove in pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@phofl
Copy link
Member

phofl commented Dec 22, 2021

cc @jbrockmendel Should we decide for 1.4 here?

@jbrockmendel
Copy link
Member

it'd be nice to get in, yah

@phofl
Copy link
Member

phofl commented Dec 23, 2021

Will look into this the coming days

@nsfinkelstein
Copy link

nsfinkelstein commented Dec 1, 2022

Can I ask what was the reasoning behind deprecating the ability to use a set as an indexer? I've followed the discussion and links above, but cannot discern why the decision was made to deprecate the use of sets as indexers. In my experience, it's incredibly helpful to be able to use sets as indexers, and I'd be happy to contribute code towards un-deprecating this option if possible.

Looking through the code a bit, my understanding is that the index will be converted into a list in any case here: https://github.com/phofl/pandas/blob/f04ec539eaf4ab1d03c1654e6f90b32c6b56f078/pandas/core/frame.py#L3505-L3507.

@phofl
Copy link
Member

phofl commented Dec 1, 2022

-1 on undeprecating, sets only work in some cases, in others they naturally raise an error. iloc is one of those cases.

@nsfinkelstein
Copy link

nsfinkelstein commented Dec 1, 2022

Thanks for the response. I meant to ask about why it was deprecated in .loc[], not .iloc[].

Is the issue that if it is deprecated for one, it must be deprecated for both?

@phofl
Copy link
Member

phofl commented Dec 1, 2022

Loc is already to flexible, so we restricted this since it was not documented anyway

@nsfinkelstein
Copy link

I see, thanks. Would you be open to a pull request that adds documentation rather than alters behavior?

I have a lot of code that relies on this behavior.

@phofl
Copy link
Member

phofl commented Dec 1, 2022

Again, I am -1 on undeprecating

@nsfinkelstein
Copy link

Can I ask why?

With such an established and widely used library, I would think removing functionality that people rely on should only happen if there's a substantive reason it shouldn't be maintained. I'm trying to understand whether that's the case here. So far I can't see any reason for this functionality to have been removed.

@phofl
Copy link
Member

phofl commented Dec 1, 2022

Set has random order, e.g. the order of your DataFrame is random as well, which leads to all sorts of weird behavior, you will probably end up implicitly aligning your DataFrame in follow up operations, which costs performance and ist just counterintuitive. A set as an indexer is in general not a good idea

@nsfinkelstein
Copy link

nsfinkelstein commented Dec 1, 2022

I see, thanks for explaining your reasoning.

I disagree with your assessment for the following reasons. There are many cases in which dataframe order does not matter, and there are also many cases in which dataframe order should be determined by elements of the dataframe, such that it must be sorted after selection in any case.

If the indexer is already a set, as it often is, than in either of these cases, converting it to a list just to satisfy an apparently arbitrary requirement of pandas is inefficient in terms of performance, and also adds unnecessary code complexity.

If the reasoning for the deprectaion is just what's contained in the above comment, I'll go ahead and submit a pull request to un-deprecate the use of sets as indexers for .loc[]. Please let me know if there were additional reasons it was deprecated.

@attack68
Copy link
Contributor

attack68 commented Dec 1, 2022

I have been a supporter of formalising and reducing the number of potential input combinations to loc. It is a complex part of the code base especially when considering hierarchical indexes. I believe there are various scattered historical discussions around this.

You are welcome to submit to a PR but my instinct tells me it will be difficult for you to garner the support to get it approved.

Are there any alternatives you can use to set, i.e. is there a sensible workaround to your problems?

@phofl
Copy link
Member

phofl commented Dec 1, 2022

Loc is complicated enough, inconsistency with iloc is another reason. Behavior is just counterintuitive. It is literally not covered at all in our testing suite. No clue which corner cases work and which don’t. It was never documented, so people shouldn’t be relying on it anyway. We deprecated this a year ago and nobody complained till now.

@nsfinkelstein
Copy link

Thanks to you both for explaining more of your reasoning.

Set operations are very well suited to creating indexers in a wide variety of settings, and so I've used sets for that purpose extensively over the course of many years, both for hierarchical and non-hierarchical indices. I could go through quite a lot of code and wrap the indexer in all my calls to .loc[] in list(), before I update to the most recent version of pandas, in all of the various places the code is deployed. But again, I would be surprised at the need to do this with such a core part of a very established library.

My understanding is that this conversion to a list is exactly what is currently happening anyway in the following lines: https://github.com/phofl/pandas/blob/f04ec539eaf4ab1d03c1654e6f90b32c6b56f078/pandas/core/frame.py#L3505-L3507. Is this understanding incorrect?

After exploring the code a bit more, my understanding is that the issue with sets with hierarchical indices arise not when the indexer is a set, but when the indexer contains sets. I.e. if the collection of indices is a set, there is no problem -- it just gets turned into a list in the lines linked above. If the collection contains sets, then there's a problem because the level of the index that each element in the indexer refers to is ambiguous. Does this match your understanding of the issues you are referencing above?

My suggestion is that we just de-deprecate using sets as indexers, not using indexers that contain sets. I'd be happy to add the appropriate documentation and tests. Given that set indexers are just converted to list indexers, unless I'm mistaken, I'm curious to know what about the behavior you find counter-intuitive. Please let me know your thoughts.

@jbrockmendel
Copy link
Member

If updating your code in too many places is the issue, then consider patching Loc.__getitem__/__setitem__ to convert lists in just one place.

I agree with @phofl and @attack68; maintaining Loc is a beast and simplifying the behavior where feasible is a positive.

@nsfinkelstein
Copy link

nsfinkelstein commented Dec 2, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Docs Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants