Skip to content

Alternative syntax for record positional field getters - viability query #2726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lrhn opened this issue Dec 15, 2022 · 12 comments
Closed

Alternative syntax for record positional field getters - viability query #2726

lrhn opened this issue Dec 15, 2022 · 12 comments
Labels
feature Proposed language feature that solves one or more problems records Issues related to records.

Comments

@lrhn
Copy link
Member

lrhn commented Dec 15, 2022

The current records proposal uses then name $0 to access the first positional field of a record.

It's a simple approach which requires no new syntax, because it's just a named member access like any other. It works with dynamic invocations. There is a risk of name conflict with positional fields, but the $ prefix should make it unlikely (more unlikely than, say, .item1, but probably not by much).

We are now considering a Swift/Rust-like syntax of record.0 instead of record.$0. That has some benefits, but also possibly some drawbacks, mainly around parsing.

Basically, we'd add '.' <DECIMAL_NUMERAL> (where <DECIMAL_NUMERAL> ::= <DIGIT>+) as a selector, similar to an identifier-named selector. It should be usable as .2, ?.2, ..2 and ?..2. (so 2 is a cascade selector). It's not (currently) an assignable selector, since it will only apply to record fields, and those are final.
It will act just like a member access for an integer-named member, and so far only records will be able to have such, and they're all getters.

The advantages is that it's shorter, some thinks prettier (because the $ is noise) and it removes the risk of name collision. It's (arguably) more reasonable to start counting from zero, than it is for more name-like getters.

**We are interested in understanding the viability of using this syntax, before going any further. **

Choosing such a syntax is mainly expected to affect the front-ends, and mostly the parsers. After that, it's expected that we can treat the integer selector as a named selector, with an unique unspeakable name for each number, for most purposes. We may want to retain the integer value if the back-ends can use it.

Possible parsing issue:

Tokenization becomes ambiguous. A .2 can be either a double literal or a . followed by a decimal selector 2, as r.2.

That's definitely an issue that needs to be resolved. It may be somewhat similar to how we handle >>>, which is tokenized into a single "triple shift" operator, but may be split into individual > s again if parsing needs to end a type argument list.

It may be possible to similarly split a double literal like .2 or 2.2 into into individual decimal-numerals and dots if it occurs in a selector or selector-name position. We do believe that those positions can never validly contain a double literal, so there will not be ambiguity between valid programs, other than a leading double numeral like 2.2; itself, which should still be tokenized as a number. It's only when a double literal occurs after another expression, or another expression followed by ./?./../?.., that it may need reinterpreting as a selector. (And we treat ?. as a single token, different from ? ., and we don't try to split that, so {e?.2:0} is a map literal.

Even if we can parse such valid programs, it may still negatively affect parser recovery for almost-correct programs.

Open design choices

There are a few ways we can vary the syntax, which could make it easier or harder to parse, but won't necessarily make any difference.

  • Allow leading zeros. Should we allow record.01 to mean the same as record.1? (No strong preference. People might want to align things, but it's otherwise unnecessary.)
  • Hex literals. Should we allow record.0xA to mean the same as record.10. (Probably not. We don't expect so many fields that it'll make much of a difference.)
  • Start at .0 or .1. Should not affect parsing.
  • Dynamic invocations. Should not affect parsing. Doing dynamicValue.2 should work on records with at least three positional fields, and fail on non-records or shorter records, which won't have a getter named 2. When it fails, should it call noSuchMethod of the object? If so, what should the memberName symbol be? (Likely no to calling noSuchMethod, but if yes, const Symbol("2") is valid. Don't want to add a #2 symbol literal.)

Developers, front-end ones first, WDYT - viable or near impossible?
@johnniwinther @jensjoha

@lrhn lrhn added feature Proposed language feature that solves one or more problems records Issues related to records. labels Dec 15, 2022
@eernstg
Copy link
Member

eernstg commented Dec 15, 2022

We should also consider the generality of this mechanism. It would have been nice if we could support this notion of integer numerals as selectors as more than a very narrow special case, but that doesn't seem to work out so smoothly:

extension on Object {
  void operator 0() => print('Hello, $this!');
}

void main() {
  2.0; // Print 'Hello, 2!', or evaluate a `double` value and discard it?
}

@lrhn
Copy link
Member Author

lrhn commented Dec 15, 2022

It's true that if we later introduce integer getters (or general members, wohoo!) on arbitrary types, then 2.0 becomes both grammatically and syntactically valid for both possible tokenizations. So we have to choose, and choosing the double literal is the only rational choice. So it evaluates to the double, discards it, and gets an analyzer warning for useless code.

Doing maximal tokenization, as usual, and then splitting the tokens if necessary, will give that effect.

I don't think there are other cases where both tokenizations of '.' DIGIT can lead to grammatically valid programs.

@dart-lang dart-lang deleted a comment from johnniwinther Dec 16, 2022
@jensjoha
Copy link

If I can get some sample test cases I can try to spend a little time hacking on it and see how far I come. Probably that should give us an idea.

@lrhn
Copy link
Member Author

lrhn commented Dec 19, 2022

Here is a CL with some hypothetical tests. Since there is no syntax support, I probably made plenty of mistakes, so take it with a grain of salt.

https://dart-review.googlesource.com/c/sdk/+/276522

@munificent
Copy link
Member

My opinions:

  • Allow leading zeros. Should we allow record.01 to mean the same as record.1? (No strong preference. People might want to align things, but it's otherwise unnecessary.)

No. It's a label whose entire lexeme is meaningful, not an integer whose value is the identifier.

Rust doesn't allow leading zeroes.

  • Hex literals. Should we allow record.0xA to mean the same as record.10. (Probably not. We don't expect so many fields that it'll make much of a difference.)

Again, no.

  • Start at .0 or .1. Should not affect parsing.

Zero. Rust and Swift both use this syntax start at zero.

All of the languages I found that use integer literals in some form (as opposed to some prefixed identifier-like thing like #1, Item1, or _1) start a zero:

  • Crystal, D: tuple[0], tuple[1], etc.
  • Scala 3: tuple(0), tuple(1), etc.
  • Dynamic invocations. Should not affect parsing. Doing dynamicValue.2 should work on records with at least three positional fields, and fail on non-records or shorter records, which won't have a getter named 2. When it fails, should it call noSuchMethod of the object? If so, what should the memberName symbol be? (Likely no to calling noSuchMethod, but if yes, const Symbol("2") is valid. Don't want to add a #2 symbol literal.)

I'm fine either way. It works since you can construct a symbol whose name is 2. Agreed that it's definitely not worth adding symbol literal support for this.

@lrhn
Copy link
Member Author

lrhn commented Dec 20, 2022

ACK on no leading zeros. Updated hypothetical test files.

Going for "cannot be intercepted by noSuchMethod" for now, which means all we have to worry about is how to create a noSuchMethodError, not necessarily an Invocation with a memberName.

(I have an even more hypothetical document theorizing how we could possibly make a grammar for this: https://gist.github.com/lrhn/f06ba8300def9cc4bfe84869c3d78229. May have no bearing on the reality of parsing.)

@jensjoha
Copy link

jensjoha commented Jan 9, 2023

It seems to work @ https://dart-review.googlesource.com/c/sdk/+/278506

@lrhn
Copy link
Member Author

lrhn commented Jan 9, 2023

That's darn impressive!

@munificent
Copy link
Member

munificent commented Jan 12, 2023

I wanted to get some actual data about whether users prefer numbered lists of things in their code to be zero-based or one-based. I did some scraping. My script looks at type parameter lists and parameters. For each one, it collects all of the identifiers that have the same name with numeric suffixes. For each of those sequences, it sorts the numbers and looks at the starting one.

After looking at 14,826,488 lines in 90,919 files across a large collection of Pub packages, Flutter widgets, and Flutter apps, I see:

-- Start (2740 total) --
   1544 ( 56.350%): 1     ===============================
   1114 ( 40.657%): 0     ======================
     59 (  2.153%): 2     ==
      6 (  0.219%): 30    =
      4 (  0.146%): 8     =
      3 (  0.109%): 11    =
      2 (  0.073%): 32    =
      2 (  0.073%): 5     =
      2 (  0.073%): 6391  =
      1 (  0.036%): 3     =
      1 (  0.036%): 91    =
      1 (  0.036%): 37    =
      1 (  0.036%): 24    =

So there's a slight preference for 1-based, but not huge. Looking at parameter lists and type parameter lists separately:

-- Parameters start (2618 total) --
   1435 ( 54.813%): 1     ==============================
   1105 ( 42.208%): 0     =======================
     55 (  2.101%): 2     ==
      6 (  0.229%): 30    =
      4 (  0.153%): 8     =
      3 (  0.115%): 11    =
      2 (  0.076%): 32    =
      2 (  0.076%): 5     =
      2 (  0.076%): 6391  =
      1 (  0.038%): 3     =
      1 (  0.038%): 91    =
      1 (  0.038%): 37    =
      1 (  0.038%): 24    =

-- Type parameters start (122 total) --
    109 ( 89.344%): 1  ===================================================
      9 (  7.377%): 0  =====
      4 (  3.279%): 2  ==

The stark difference here suggests that may be some outlier code defining a ton of type parameter lists with a certain style. Indeed, if we look at the number of sequences in each package:

-- Package (6089 total) --
   1344 ( 22.073%): ffigen-6.1.2
    500 (  8.212%): realm-0.4.0+beta
    440 (  7.226%): artemis_cupps-0.0.76
    308 (  5.058%): _fe_analyzer_shared-46.0.0
    277 (  4.549%): tencent_im_base-0.0.33
    250 (  4.106%): realm_dart-0.4.0+beta
    172 (  2.825%): flutter-flutter
    167 (  2.743%): invoiceninja-admin-portal
    167 (  2.743%): invoiceninja-flutter-mobile
    111 (  1.823%): statistics-1.0.23
     71 (  1.166%): dart_native-0.7.4
     59 (  0.969%): sass-1.54.5
     56 (  0.920%): fpdt-0.0.63
     53 (  0.870%): objectbox-1.6.2
     49 (  0.805%): medea_flutter_webrtc-0.8.0-dev+rev.fe4d3b9cd21a390870d5390393300371fe5f1bb2
     46 (  0.755%): linter-1.27.0

So ffigen (whose names suggests contains a ton of generated code) heavily skews the data.

Really, what we want to know is not what each sequence prefers, but what each user prefers. If only one user prefers starting at zero and everyone else prefers starting at one, but that user authors thousands of parameter lists, that doesn't mean they get their way.

To approximate per-user preference, I treated each top level directory as a separate "author". For each one, I looked at all of the sequences in it to see if they start at one, zero, (or both):

-- By package/author (338 total) --
    305 ( 90.237%): Only one-based                 ===========================
     22 (  6.509%): Only zero-based                ==
     11 (  3.254%): Both zero-based and one-based  =

While there are many sequences that start with zero, they are heavily concentrated in a few packages like ffigen and realm. When you consider each package as a single vote for a given style, then there is a much larger number of packages that contain one-based sequences. If you look at them, each one-based package only has a fairly small number of sequences. But there are many of these packages. That suggests that most users hand-authoring type parameter and parameter sequences prefer starting them at one.

Based on that, I think we should start positional record field getters at one too.

@jensjoha
Copy link

I wanted to get some actual data about whether users prefer numbered lists of things in their code to be zero-based or one-based. I did some scraping. My script looks at type parameter lists and parameters. For each one, it collects all of the identifiers that have the same name with numeric suffixes.

So what you're saying is that you're looking for "foo1" and "foo2" for instance (vs "foo0" and "foo1"?)

(My 2 cents would be that with lists starting at 0 it would add confusion that these don't.)

@munificent
Copy link
Member

munificent commented Jan 12, 2023

So what you're saying is that you're looking for "foo1" and "foo2" for instance (vs "foo0" and "foo1"?)

Yes, exactly. It looks for parameters whose identifier is [alpha][number] and groups them by their shared prefix. Then for each of those groups with more than one entry, it looks at the lowest number in the range. The code for the script is here.

(My 2 cents would be that with lists starting at 0 it would add confusion that these don't.)

That was my intuition too, which is why the proposal initially had them start at zero. But from looking at the data, it seems pretty clear that when users number sequences of identifiers, they mostly start them at 1. See, for example, Object.hash().

@munificent
Copy link
Member

munificent commented Jan 12, 2023

It seems to work @ https://dart-review.googlesource.com/c/sdk/+/278506

Incredible!

We spent a bunch of time discussing this in the language meeting. While it appears to be technically feasible, from looking through the tests we concluded that it's just too weird and brittle. Given that Dart already has floating point literals that don't require a leading digit, null-aware operators, and cascades, it will be very hard (but apparently not impossible!) for tools to parse it correctly. While we might be able to get our compiler to handle it, all of the various syntax highlighters, static analyzers, IDE integrations, tree-sitters, etc. might not be so lucky.

It's just a bridge too far. I don't think anyone loves the $1 syntax, but it's simple and safe. For a feature that we don't anticipate being used heavily—users should prefer destructuring—that's the right trade-off.

@jensjoha, thank you working through an experimental implementation. I really appreciate it. In particular, the thorough tests are good for giving us a sense of what we'd be getters ourselves (and our users) into if we did this syntax.

We've decided to just stick with the current proposal and use $.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Proposed language feature that solves one or more problems records Issues related to records.
Projects
None yet
Development

No branches or pull requests

4 participants