Skip to content

ENH: Making Pandas dataframe slicing syntax match R dataframe syntax. #35659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jgbradley1 opened this issue Aug 10, 2020 · 3 comments
Closed
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jgbradley1
Copy link

jgbradley1 commented Aug 10, 2020

I imagine the following request is popular enough that it has probably been asked before. I just couldn't find it. Feel free to close if the discussion has already occurred.

Consider the following R example:

# R code
> df <- data.frame(A=c(0:4), B=c("str0","str1","str2","str3","str4"), C=(1:5))
> df
  A    B C
1 0 str0 1
2 1 str1 2
3 2 str2 3
4 3 str3 4
5 4 str4 5

The equivalent Pandas syntax would be:

# python code
> df = pd.DataFrame({ "A":range(0,5), "B":["str"+str(x) for x in range(0,5)], "C":range(1,6) })
> df
   A     B  C
0  0  str0  1
1  1  str1  2
2  2  str2  3
3  3  str3  4
4  4  str4  5

Now when trying to slice a R dataframe, the syntax allows direct access with the brackets operator:

# R code
> df[1,]
  A    B C
1 0 str0 1

The pandas equivalent would be:

# python code
> df.iloc[0,]
A       0
B    str0
C       1
Name: 0, dtype: object

Disregarding the 0-index vs 1-index between the languages, I would like to propose adding a new slicing operation to the Pandas dataframe getitem() that overloads the brackets operator and matches the behavior of R-style syntax. It would allow for cleaner and less-verbose code (especially when chaining multiple slice operations).

API breaking implications

The API change would be a positive addition. No removal of current slicing operations (iloc, loc, etc.)

Additional context

The following slicing examples will error out in Pandas whereas it would be valid slicing in R:

# python code
> df[0]
> df[0, ]
> df[0,0]
> df[0, 1:2]

Note that calling iloc with the above commands would all succeed. What I'm suggesting is that the logic for iloc be copied/moved into Pandas dataframe __getitem__() function.

@jgbradley1 jgbradley1 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2020
@jreback
Copy link
Contributor

jreback commented Aug 10, 2020

@jgbradley1 this is impossible to actually do (as would break the world), nor likely wanted.

__getitem__ is currently an multi-beast that attempts to figure out what you want. See #9595 for extensive discussions.

We already have .iloc for pure positional indexing and .loc for pure label based indexing. These are extensively documented and should be clear.

[] is most effective for column selection & boolean indexing; any other uses should be deprecated (certainly if you want to contribute would be great).

@jreback jreback closed this as completed Aug 10, 2020
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2020
@jreback jreback added this to the No action milestone Aug 10, 2020
@jgbradley1
Copy link
Author

@jgbradley1 this is impossible to actually do (as would break the world), nor likely wanted.

@jreback Thanks for the quick follow-up. I'm not sure what you mean. If you're referring to the syntax, I wrote a very basic __getitem__ example that accepts the examples I wrote before.

class PseudoDataFrame:
  def __getitem__(self, pos):
      if type(pos) in [int,str,slice]:                                           # single argument passed with no comma
          print("column selection (1 argument) - fetching %s" % (pos))
      elif len(pos) == 1:                                                        # single argument passed with comma
          print("column selection (empty 2nd argument) - fetching %s" % (pos))
      elif len(pos) == 2:                                                        # two arguments passed
          x,y = pos
          print("col and row selection - fetching %s, %s" % (x, y))
      else:
          print("ERROR - selection syntax not recognized")
      return None

m = PseudoDataFrame()
m[1]
m['A',]
m[1:2]
m[0, 0]
m[0, 0:2]
m['A':'C', 0:2]

From that example, it wouldn't be too difficult to extend and allow the mixed used of integer-based and label-based indexing all within the bracket notation and thus not forcing users to pick between .iloc and .loc.

Thanks for linking to the other issue. Seems like the discussion has been going for a long time but I did want to post a working example all the same to show it's not impossible.

@jreback
Copy link
Contributor

jreback commented Aug 11, 2020

@jgbradley1 its certainly possible and we used to have the .ix syntax to do exactly this. Its a bad idea and was removed several versions ago after a long deprecation period. Differentiating between positionan location based indexing in the real world is hard.

It is impossible to change the default as something would likey break the world w/o an easy deprecation. Extensions to [] are not being considered, only deprecations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

2 participants