Skip to content

Comments on tagged strings #1983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lrhn opened this issue Nov 23, 2021 · 14 comments
Open

Comments on tagged strings #1983

lrhn opened this issue Nov 23, 2021 · 14 comments

Comments

@lrhn
Copy link
Member

lrhn commented Nov 23, 2021

Re. https://github.com/dart-lang/language/blob/master/working/tagged-strings/feature-specification.md

  • Implicitly appending "StringLiteral" to the name is a hack. Not sure what is better, but I'd go with
    extension WhatNot on String {
      R operator foo(List<String> s, List<T> i) { ... }
    }
    to declare a prefix operator on strings instead. Then the operation is not looking up things in the lexical scope
    at all, but on the string. Not hacking the lexical scope is better.
  • Grammar needs to say that identifier cannot be r, which is taken for raw strings.
  • The grammar only allows a single identifier as the tag. It should at least be qualified, so you can use a prefixed import (if not going the extension operator way, which is resolved wrt. the string, not the scope).
  • Possibly even two-dotted qualified name, so you can refer to a static function with a prefix.
  • (Alternatively, allow any expression to be followed by a string literal, without name manglig.)
  • The interpolated expressions should be typed, not just as Object?. That's too JavaScript-y.
    • A string tag of type R Function(List<String>, List<Y Function()>) should only allow expressions with type Y in interpolations.
    • A generic R Function<X>(List<String>, List<X Function()>) tag should infer the X from the interpolation expressions (and R can then depend on it).
  • Not sure thunking interpolations is worth it all of the time. Maybe you can opt in to it.
    • If your tag has type R Function(List<String>, List<T Function()> for some type T, then an interpolation expression of type T is auto-thunked for you (and an interpolation expression of type T Function() is not).
    • Or maybe you have to provide a syntactic marker, like using $`expression` instead of ${expression} to have it thunked.
  • It's also dangerous to move code into a different function for a number of reasons, current and potential future:
    • As you noticed, it doesn't play well with async. There was a reason we stopped desugaring into lambda expressions in the spec, it was just wrong most of the time after introducing async.
    • We don't know if it will interact similary with future features. It's a design smell to move code from the context where it's written.
  • In general, I think I'd just not do implicit thunking, the interpolation expressions are always evaluated eagerly, and the second parameter has type List<Y>, not List<Y Function()>.
  • About async.
    • I think not making interpolations thunked at all by default makes it trivial. So do that.
    • If not, it's going to need a truck-load of heuristics to get it right in most of the cases.
    • If you have to ask for thunked expressions (in any way), and you do so inside an async function, then:
      • either your second parameter accepts a future, List<Future<T> Function()> or List<FutureOr<T> Function()>, and then the function is thunked using an async method if necessary (has type T, or has type Future<T> and contains await),
      • Or it doesn't, and then the expressions are thunked using a non-async function, which makes it an error to contain await.
@jakemac53
Copy link
Contributor

  • Implicitly appending "StringLiteral" to the name is a hack.

+1, fwiw I am not personally concerned about polluting the scope with simple identifiers. We have solutions for that already. Appending StringLiteral is a major code smell to me.

The interpolated expressions should be typed, not just as Object?. That's too JavaScript-y.

Big +1 from me on this as well, it seems like we should support generics basically as you have suggested.

Not sure thunking interpolations is worth it all of the time. Maybe you can opt in to it.

I also agree with this. The user can already opt in by simply passing a function into the template, and its a lot more explicit/obvious (granted the template has to support it). This is what the logging framework today does - you can pass a function for the message which will be lazily invoked if the log is actually converted to a string.

Existing string interpolations are immediately evaluated, and I think it would be unexpected (and error prone) to do something different for tagged templates.

@rakudrama
Copy link
Member

dart2js uses a JavaScript AST library that supports 'compiling' the template to a instantiator function.
This is my test case to see if the proposal significantly improves the usability of this library, example.

If I have the tagged string expr '2 * $a + $b', some of that is available at compile time.
I'd like to be able to push the early binding as far as possible.
If part of the answer is that I must look up the list of strings to find the 'processed skeleton', then the repeatedly allocated list of strings would need to be replaced with something with less overhead and is an efficient map key. Perhaps a const list is good enough.

I'd like to see a better example than the implementation of exprStringLiteral.
I would expect the processor to avoid converting everything back to a string.
A safe-html tag or a safe-sql tag would want to avoid re-parsing to avoid injection attacks.

Is there a nice way to use a class name as a tag? I think

int '12345'

is confusing, since it is not an int, but if class names could be used as tags then we would be able to write

BigInt '12345'
DateTime '10:35:37'

This would be especially nice if tagged strings could be constants somehow.

@munificent
Copy link
Member

This feedback is great!

  • I like the idea of making it an extension operator on String. Using extension method lookup seems like a good idea. Name mangling is hacky. (For what it's worth, this is what C# does for attribute classes.) This also solves the need for dotted names.
  • I don't think we should allow arbitrary expressions before a string literal. That sounds like grammar hell. I would want this to be a simple static syntactic feature.
  • Typing the interpolated expressions sounds good to me too.

The hard question is about thunking the interpolated expressions. If we don't support this at all, it significantly limits the feature. I think there are valuable use cases around conditionally executing some expressions, catching exceptions, etc. On the other hand, it would be a little odd to support it here but not for other kinds of parameters. Maybe the right answer is to have a general lazy call-by-name feature at some point in the future which could then be used here too.

For using this syntax in the macro authoring API, eager evaluation is actually fine, which is an indication that maybe thunking these was too adventurous on my part. On the other hand, the ancient Dash string template feature (remember that?) did make them closures...

I actually spent quite a lot of time trying to figure out a design for this that made the tagged strings be expression macros that are processed at compile time. This would let a particular tag processor decide whether to evaluate the expressions eagerly or lazily. But we don't currently have a plan for expression-level macros and I wasn't able to come up with a simple enough proposal to add them. In a perfect world, that would still be my preferred choice. But without expression-level macros, let expressions, etc. it's really hard to make tagged strings do anything useful at compile-time. And adding all of those is a lot of additional surface area for an already very large feature.

@munificent
Copy link
Member

dart2js uses a JavaScript AST library that supports 'compiling' the template to a instantiator function. This is my test case to see if the proposal significantly improves the usability of this library, example.

Agreed. That's an excellent example of an API that should be able to use this. That would let you write:

        body = jsStatement 'return ${_emitter.prototypeAccess(superClass)}.$methodName.call(this, $targetArguments);');

If I have the tagged string expr '2 * $a + $b', some of that is available at compile time. I'd like to be able to push the early binding as far as possible. If part of the answer is that I must look up the list of strings to find the 'processed skeleton', then the repeatedly allocated list of strings would need to be replaced with something with less overhead and is an efficient map key. Perhaps a const list is good enough.

I'm sorry, I didn't follow all of this.

@munificent
Copy link
Member

OK, I've updated the proposal (22ef7f2) to make it eager and to allow more tightly typed interpolated expressions. The remaining issues is:

  • Implicitly appending "StringLiteral" to the name is a hack.

+1, fwiw I am not personally concerned about polluting the scope with simple identifiers. We have solutions for that already. Appending StringLiteral is a major code smell to me.

The proposal still has that. I am concerned about polluting the scope with simple identifier. I definitely do not want the tag processor function to be in the top level lexical scope with a name like expr, html, etc. That's just begging for trouble.

Appending StringLiteral is... crude... but it's simple and works. It's how C# handles Attribute classes. We may even want to consider something similar for macros since data is a pretty common word to stick in the top level scope.

  • Not sure what is better, but I'd go with
    extension WhatNot on String {
      R operator foo(List<String> s, List<T> i) { ... }
    }

I liked this idea at first, but then I realized it doesn't actually make sense. The problem is that there is no string instance to call this on. There's no this since you don't have a string. You just have the two lists. It's not a method you call on a string because it exists prior to any string ever being created.

We could require:

extension WhatNot on String {
  static R operator foo(List<String> s, List<T> i) { ... }
}

But now we're inventing some completely new "static operator" concept which feels pretty weird. I also think it's somewhat dubious to allow any user-defined identifier after operator. Doing so might get in the way of us adding actual operators for certain identifiers in the future.

We can't make an extension constructor on String because (1) extensions can't have constructors and (2) it may not construct a string.

I don't love suffixing the name, but I don't have any better ideas either.

@leafpetersen
Copy link
Member

I liked this idea at first, but then I realized it doesn't actually make sense. The problem is that there is no string instance to call this on. There's no this since you don't have a string. You just have the two lists. It's not a method you call on a string because it exists prior to any string ever being created.

Why not make it an extension on a class named TaggedStringLiteral which contains the two lists?

@lrhn
Copy link
Member Author

lrhn commented Nov 30, 2021

I'd actually prefer to not allocate two lists (and a TaggedStringLiteral wrapper on top). I guess we can optimize it to have immutable lists (the String list can be const and the values list can be backed by the stack if we know it doesn't escape).

The #1478 proposal instead uses a builder pattern where the "tag" has addString(String) and addValue(T) methods that are called for each part of the interpolation, whether string or interpolation expression of type T. The biggest issue with that is that it needs a final close()/toValue() method to be called after the last part.
(With something #1479, which allows collection for/if elements in interpolations, it makes even more sense to be allowed to do repeated addValue calls with no string between them).

@jakemac53
Copy link
Contributor

I'd actually prefer to not allocate two lists (and a TaggedStringLiteral wrapper on top). I guess we can optimize it to have immutable lists (the String list can be const and the values list can be backed by the stack if we know it doesn't escape).

What if we just made them Iterables instead? Then they are already immutable (statically too), and they can be lazy, and avoid the list allocations. The main issue I see with that is it can be pretty awkward to iterate over all the elements in order (alternating between the two iterables). You end up needing to use the iterators directly, or just converting them to lists.

@lrhn
Copy link
Member Author

lrhn commented Nov 30, 2021

Using iterables is an interesting idea. It could even lazily evaluate the expressions when you read current (but that still has some the dangers of thunking, in case some expressions are never evaluated).

I generally don't like the two-seperate-iterables/lists approach.
It could also be an Iterable<StringOrValue<T>> where each element is either a string slice or the value of an interpolation expression. Probably requires one extra object allocation per element, unless we are very clever at optimizing and inlining.

Compare this to C#'s collection initializers: new HashSet<int> { e1, e2, e3, e4 }, which is equivalent to new HashSet<int>()..Add(e1)..Add(e2)..Add(e3)..Add(e4) (if they had cascades).
That's built by evaluating the expressions, then calling something on the receiver. They can also call setters of objects, like new Cat { Age = 10, Name = "Fluffy" }.
It's like a builder, only without the final close/build call, because the collections are mutable.

So, in short, not happy about the two separate lists, not overly fond of a List<OneOf<String,T>>, and generally preferring a builder-approach (but want a real builder approach with a final "close"/"build" call, and it's annoying to have to choose a name for that, not something that only works on mutable objects).

@jakemac53
Copy link
Contributor

jakemac53 commented Nov 30, 2021

Good point re: the same problems as thunking potentially. I also agree that I dislike the two lists/iterables approach, it is just very awkward to deal with even when they are lists (and it requires useless empty strings in places etc if tagged string starts or ends with a thunk).

I think I agree with you that the builder approach is better.

  • Its efficient
  • It is ultimately less clunky and error prone (no weird indexing issues, having to know that the list of strings is one larger, etc)
  • The api can still be strongly typed (you don't have to do is checks on elements to figure out if its a string or thunk)
  • We can ensure all the thunks are actually evaluated, and exactly once as well

@leafpetersen
Copy link
Member

Using iterables is an interesting idea. It could even lazily evaluate the expressions when you read current (but that still has some the dangers of thunking, in case some expressions are never evaluated).

I generally don't like the two-seperate-iterables/lists approach. It could also be an Iterable<StringOrValue<T>> where each element is either a string slice or the value of an interpolation expression. Probably requires one extra object allocation per element, unless we are very clever at optimizing and inlining.

If this discussion is driven by performance concerns (which it seems like it started as?) I'm fairly skeptical that using iterables and builders is going to be more performance than simply allocating a list and calling a statically known function. You're adding a lot of code indirections that the compiler has to figure out, instead of just inlining a function and deconstructing a list literal. Maybe you have concrete strategy in mind that is optimizable, but I'd like to see some evidence.

@jakemac53
Copy link
Contributor

You're adding a lot of code indirections that the compiler has to figure out, instead of just inlining a function and deconstructing a list literal. Maybe you have concrete strategy in mind that is optimizable, but I'd like to see some evidence.

There is some evidence from the previous JSON explorations, which is a similar situation. Avoiding the intermediate lists/maps during parsing and going directly to objects there was significantly faster in my testing, although I don't remember the exact numbers.

But it shouldn't be that difficult to write a benchmark for this specific case as well to see how things compare and we should definitely do that :).

@munificent
Copy link
Member

I liked this idea at first, but then I realized it doesn't actually make sense. The problem is that there is no string instance to call this on. There's no this since you don't have a string. You just have the two lists. It's not a method you call on a string because it exists prior to any string ever being created.

Why not make it an extension on a class named TaggedStringLiteral which contains the two lists?

That's a neat idea. I'm not crazy about requiring allocating an instance of that wrapper class, but maybe it could get inlined and eliminated. Does that feel significantly less magical or hacky to you?

What if we just made them Iterables instead? Then they are already immutable (statically too), and they can be lazy, and avoid the list allocations.

I did consider that. In fact, because Iterables are implicitly lazy, that would even potentially give us a way to evaluated the interpolated expressions lazily—we could specify that each interpolated expression is only evaluated when the iterator reaches that position.

But, in practice, I think the code the compiler would have to generate to implement these iterables would be quite complex and likely slow. I don't want to rely on a Sufficiently Smart to get tolerable performance.

The #1478 proposal instead uses a builder pattern where the "tag" has addString(String) and addValue(T) methods that are called for each part of the interpolation, whether string or interpolation expression of type T. The biggest issue with that is that it needs a final close()/toValue() method to be called after the last part.
(With something #1479, which allows collection for/if elements in interpolations, it makes even more sense to be allowed to do repeated addValue calls with no string between them).

Going with a builder style is definitely interesting. I look at this as mainly a question of internal versus external iteration. If we use a builder protocol, then the string literal drives the tag processor. It decides which add() calls are invoked when. If we use lists like my proposal, then the tag process gets to decide when and how to iterate over the values.

There's no free lunch here. Whichever way we choose makes it easier to write code for one side at the expense of the other. If we use a builder API, it gets harder to a tag processor to do things like:

  • Perform some work between each each call.
  • Pre-allocate structures based on how many interpolated expressions will be processed.
  • Skip over or discard some values.

Since the compiler will be writing the code to call the processor, I think it makes sense to put the hard work there and make it easier for a human to implement the tag processor. Giving them two complete lists is pretty simple and easy to work with.

Also, as noted, the string list can be const in this case. Heck, the compiler might even be able to make interpolated value list const in some cases. In general, doing a builder pattern will make it really hard to play nice with immutability.

Compare this to C#'s collection initializers: new HashSet<int> { e1, e2, e3, e4 }, which is equivalent to new HashSet<int>()..Add(e1)..Add(e2)..Add(e3)..Add(e4) (if they had cascades).
That's built by evaluating the expressions, then calling something on the receiver. They can also call setters of objects, like new Cat { Age = 10, Name = "Fluffy" }.

That made sense for C# at the time because immutability wasn't really a thing and its existing data structures were always created empty and them imperatively filled up. But going in this direction for Dart feels like a step backwards to me. Users prefer objects that are created in a fully initialized state and are rarely mutated afterwards. Dart has const and final fields and encourages use of both. We're looking to add immutable tuples and records. We're exploring using immutability to enable sharing objects across isolates.

Adding syntactic sugar that explicitly desugars to mutation seems like it runs counter to all that.

@lrhn
Copy link
Member Author

lrhn commented Nov 30, 2021

I guess the TaggedStringLiteral class could be something you can query, rather than just containing two lists.
Maybe:

class TaggedStringLiteral<T> {
  int get valueCount;
  int get stringCount;
  int get length;
  bool isValue(int index);
  T valueAt(int index);
  String stringAt(int index);
}

which hides the implementation and doesn't require strict interleaving of strings and values.
You can still tell how many strings and values occur, and iterate them in order, you just have to ask for each one whether it's a value or string (and the valueAt and stringAt will throw if you use the wrong one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants