Skip to content

Generalized string interpolation #1479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lrhn opened this issue Feb 26, 2021 · 22 comments
Open

Generalized string interpolation #1479

lrhn opened this issue Feb 26, 2021 · 22 comments
Labels
feature Proposed language feature that solves one or more problems

Comments

@lrhn
Copy link
Member

lrhn commented Feb 26, 2021

Currently string interpolation can only create strings. It's a powerful template mechanism, but it's restricted to creating a string at the end.

If we instead allowed the string parts and values to be collected into a general interface, instead of just something like a StringBuffer, then other kinds of "literals" could use the feature.

Let's say we defined:

abstract class Interpolator<R, T> {
  Interpolator<R, T> addString(String string);
  Interpolator<R, T> addValue(T value);
  R toValue();
}

and then allowed the syntax <postfixExpression> <stringLiteral> to be used to provide an Interpolator which is called with the parts, returning a new (or the same) interpolator with the updated state, instead of just turning it all into a string.

Example:

class JsonInterpolator implements Interpolator<String, Object?> {
  final StringBuffer _buffer = StringBuffer();
  addString(string) { 
    _buffer.add(string);
    return this;
  }
  addValue(value) {
    _buffer.add(jsonEncode(value));
    return this;
  }
  String toValue() => _buffer.toString();
}
JsonInterpolator get jsn => JsonInterpolator();

This class would allow you to write:

var myJson = jsn"""
{ $name: $value,
  "other": [$v1, $v2],
  "all": [
  ${for (var i = 0; i < values; i++) ...[if (i > 0) ",", values[i]] /* using #1478 */}
  ]
}
""";

and have all the values which are plugged into the string be JSON encoded first.

An expression of the form e stringLiteral would be a compile-time error if e was not assignable to Interpolator<X, Y> for some X and Y. It is a compile-time error if the elements of the interpolation are not assignable to Y. The static type of the expression is X.

The default interpolation is just an implicit Interpolator<String, Object?> which does .toString() on all values and concatenates the strings.

Grammar-wise this conflicts with r"string" if r refers to an Interpolator. Maybe we need to put a symbol between the two, but all the good symbols are taken. Maybe we could introduce a new syntax instead: <e>"string", it just looks a little too much like a type. We could make it a suffix instead, so "{$x:$y}"jsn, looking more like a RegExp flag. I think it's better for readability to be in front.

It would change auto-concatenation of string literals, because that only works for actual strings.
A string literal with a non-default interpolator is not concatenated with any preceding string literals. It may apply to all of the following adjacent string literals.

It might even be possible to extend this to map and collection literals:

collectionBuilder {e1, e2, e3 }
mapBuilder {k1: v1, k2: v2, k3: v3}

with APIs like abstract class CollectionBuilder<T> { void add(T element); } and abstract class MapBuilder<K, V> {void operator[](K key, V value); }. (It's not absolutely clear whether they need a toValue method as well, which would prevent using the current Set/List/Map APIs.

@lrhn lrhn added the feature Proposed language feature that solves one or more problems label Feb 26, 2021
@passsy
Copy link

passsy commented Feb 26, 2021

Comparison to extensions

At first glance, there is not much difference in syntax compared to extensions:

final interpolated = json"{$name: $value}";
final withExtension = "{$name: $value}".json;

The real benefit is that with the Interpolator API it would be possible to map the interpolated values automatically with the correct encoding. Each value might get urlEncoded or jsonEncoded which is a manual step with the current string interpolation.

The extension example with correct encoding would therefore simplify.

final interpolated = json"{$name: $value}";
final withExtension = "{$name: ${jsonEncode(value)}}".json;

All tokens at once

The proposed API splits parsing strings and interpolated values in two methods which are called in order the "tokens" appear in the interpolated string.

abstract class Interpolator<R, T> {
  Interpolator<R, T> addString(String string);
  Interpolator<R, T> addValue(T value);
  R toValue();
}

Maybe it would be easier for the implementer to get the full list of "tokens" at once. This would allow to look ahead to decide on the correct encoding for each value. That's also possible with the proposed API by building the list and parsing it when toValue() is called. But the list probably already exists. Why not exposing it.

abstract class Interpolator<R, T> {
  R toValue(List<Token<T>> tokens);
}

abstract class Token<T> {
  bool get isString;
  bool get isValue;
  String get string;
  T get value;
}

@lrhn
Copy link
Member Author

lrhn commented Feb 26, 2021

I thought about an API like the Token thing, but decided against it.

It's an extra overhead. If you want it, you can define your own Token class and collect the values internally, so it doesn't provide more power, only less flexibility and more allocations. I'd like to not have more allocations than necessary.

If you want to throw in one of the add calls, then you can do so immediately, and not wait for the rest of the expressions to be evaluated.

It's a "push" API, so the caller knows whether it's a string or a value, the compiler can literally turn

var myJson = jsn"{ $name: $value }";

into

var myJson = jsn.addString("{ ").addValue(name).addString(": ").addValue(value).addString("}").toValue();

(There is an issue with my approach, though. The spread ...[if (i > 0) ",", something, ": "...] seems to assume that the strings in there are emitted as strings, but they will be values, so it's not possible to programmatically insert strings into the result without going through addValue. That is annoying.
Maybe we should allow an interpolation element, because interpolations should be elements anyway, of ..."something${foo}other" (with a leading ... like a spread, but it's a string expression, not an iterable) to add directly to the surrounding collector.)

@Cat-sushi
Copy link

Cat-sushi commented Feb 27, 2021

I believe the rest parts of interpolated strings outside of $variable and ${expression} also be converted by addString().
Then, class g implements Interpolator<Characters, Object?> can be used to make grapheme cluster literals. #1432
But, the results can't be constants.
Is it correct?

@lrhn
Copy link
Member Author

lrhn commented Mar 1, 2021

You can creater a gc"string" prefix which is just the normal behavior for string interpolations, except that toValue returns the .characters of the string.

class _GraphemeClusterInterpolator implements Interpolator<Characters, Object?> {
  final StringBuffer buffer = StringBuffer();
  addString(String value) {
    buffer.write(value); 
    return this;
  }
  addValue(Object? value) {
    buffer.write(value.toString()); 
    return this;
  }
  toValue() => buffer.toString().characters;
}

Interpolator<Characters, Object?> get gc => _GraphemeClusterInterpolator();

You can also do things like xml "<html> ... </html>" which parses the string and returns a non-string value (like the JSON example). You can use a progressive/chunked parser and do things incrementally, without building the entire source first (and not need to convert $value to a string first, and then parse it back later, if the value must be a valid XMLNode).

It does mean the the result can't be const, not unless we increase the capability of constant computation significantly.

@passsy
Copy link

passsy commented Mar 1, 2021

Possible use case DateFormat

While it isn't shorter, it could be a typesafe way to construct a DateFormat.

final DateFormat dateFormat = DateFormat("h:mm a");
final DateFormat dateFormatInterpolated = 
    df"${DfSymbol.hourInAmPm}:${DfSymbol.minuteInHour} ${DfSymbol.amPmMarker}";
class DateFormatInterpolator implements Interpolator<DateFormat, DfSymbol> { }

enum DfSymbol {
  // h
  hourInAmPm,

  // mm
  minuteInHour,

  // a
  amPmMarker
}

Multiple interpolated types

The question I was asking myself whether it would be possible to also inject normal Strings via interpolation in such a date format interpolated string.

final String username = account.userName;
final DateFormat dateFormatInterpolated = 
    df"It's ${DfSymbol.hourInAmPm}:${DfSymbol.minuteInHour} ${DfSymbol.amPmMarker} for $username";

To make it work with the current API we need sum types #83. Then one could write

class DateFormatInterpolator implements Interpolator<DateFormat, DfSymbol|String> { }

@Cat-sushi
Copy link

@lrhn

You can creater a gc"string" prefix which is just the normal behavior for string interpolations, except that toValue returns the .characters of the string.

It does mean the the result can't be const, not unless we increase the capability of constant computation significantly.

It sounds better, but not best, for me.

@lrhn
Copy link
Member Author

lrhn commented Mar 1, 2021

I agree that Dart doesn't need JSON-string-literals.
A more realistic example would be XML-string-literals, where the values can be XML Nodes, or perhaps strings which are then properly escaped. Or SQL literals where again the values are escaped for you.

Or some kind of template system (like, Dart code generation). It's not perfect for that because it doesn't nest well. If I want to inline a list, I cant just do ${for (var x in list) ...[x, ", "]} because the comma string will be treated as a value, not a string.
Maybe I just need a more comprehensive syntax, Scheme's quote/unquote 😁 .

@munificent
Copy link
Member

I've been wanting named string templates for a long time. I think I have an internal doc from 2011 proposing it. :)

With the static metaprogramming stuff, @jakemac53 and I are considering using them to also provide a nicer syntax for constructing pieces of Dart syntax, like:

var e = expr"{}";
var s = stmt"{}";

Using bare strings has some problems because, as in the examples above, the language is ambiguous if you don't know what grammar production you are trying to parse. A {} is a block in a statement context and an empty map in an expression context. Using named string templates with different templates (here, expr and stmt) would provide the API enough context to know how to parse the string.

Personally, I'm not crazy about the push API you defined. I think it will give templates more flexibility to have a pull API. In particular, I'd rather the interpolated expressions be thunks so that the template handler can choose when/if to evaluate them, handle exceptions coming from them, etc.

Of course, if you start talking about wanting to give user code the ability to not evaluate some subexpressions, that starts to look a lot like a macro... So Jake and I have discussed a little about whether some kind of named string template thing should be a compile-time API that gets expanded using static metaprogramming. We don't have anything at all coherent for that yet, though.

But, overall, yes, I would love named string templates like this.

@ykmnkmi
Copy link

ykmnkmi commented Aug 10, 2021

@lrhn can we use encoding constants as prefix:

const List<int> hello = utf8'Hello, 世界';

@Cat-sushi
Copy link

String literals are already utf-8 without prefix.
And, I think String literals with prefix in this proposal can't be constants.

@Cat-sushi
Copy link

By the way, there is no definition of the encoding of source code in the spec.
What should define the encoding of source code?

@lrhn
Copy link
Member Author

lrhn commented Aug 11, 2021

String values are encoded as UTF-16 (or rather, they are sequences of UTF-16 code units, not necessarily valid UTF-16), not UTF-8. The proposed idea here should be able to create a Uint8List from utf8'some text'. It will do so at run-time, from the string values.

I'd prefer if it was possible to create the UTF-8 bytes at compile-time instead, but that's probably a job for macros (@jakemac53 - expression macros which expand to something else, yay or nay?)

Dart source text is represented as a sequence of Unicode code points.

That's all the spec says, but in practice the compiler only accepts UTF-8. (Just tried with UTF-16 LE/BE with BOM, and no success, it must be UTF-8).

@ykmnkmi
Copy link

ykmnkmi commented Aug 11, 2021

My idea is to use other encodings as well: json, ascii, ... , myCustomCodec.

@jakemac53
Copy link
Contributor

I'd prefer if it was possible to create the UTF-8 bytes at compile-time instead, but that's probably a job for macros (@jakemac53 - expression macros which expand to something else, yay or nay?)

Yes I think expression level macros would be well suited for this (but we haven't attempted to specify them yet)

@Levi-Lesches
Copy link

There is small discussion of something similar here. A strawman could be:

@b
static const _HTTP = 'HTTP';

// generates:
static const HTTP = [72, 84, 84, 80];

@Cat-sushi
Copy link

Sorry, I've misunderstood.
Yes, source code is utf-8 but 'Hello, 世界' is compiled to utf-16.

I'm not sure what specify the character encoding of source code, though.

@ykmnkmi
Copy link

ykmnkmi commented Aug 12, 2021

@Levi-Lesches dart has external keyword for internal implementation:

@b('HTTP')
external List<int> get HTTP;

// generates:
const List<int> HTTP = <int>[72, 84, 84, 80];

@Cat-sushi
Copy link

FYI, C# supports utf-8 string literal.
What's new in C# 11 - C# Guide | Microsoft Learn

@lrhn
Copy link
Member Author

lrhn commented Sep 15, 2024

Another option for parametrization is to allow controlling escapes.

Consider adding a member

  int addEscape(Iterable<int> charCodes);

which get called when seeing a \.
(You can't intercept the escape of the current quote, like \" for a " or """ string, because the parser needs to know where the string literal ends.)

Maybe the tag can implement one of RawTag for no escapes (or interpolations, but what's the point then?), or CustomEscapeTag for intercepting escapes, so the parser knows how to treat the coming
Maybe it can also choose whether to apply to only one string literal, or all following string literals, to allow adjacent string literals to combine. But maybe that should be the default. (Otherwise a tag "string" "string" where the first tag "string"would evaluate to a valid tag, would be ambiguous.)

An easier approach would be to just recognize normal escapes or not, but allow "invalid escapes" and keep them in the strings passed to the tag, rather than remove them. Then the tag processor can interpret them as it wants.

@ghost
Copy link

ghost commented Sep 19, 2024

This format (aesthetically unattractive as it is), also has an unfortunate property: it makes it difficult to control whitespace and indentation. Here's how the program will look in real life:

class A {
  String generateJson() {
    //...  
    return jsn"""
{ $name: $value,
  "other": [$v1, $v2],
  "all": [
    ${for (var i = 0; i < values; i++) ...[if (i > 0) ",", values[i]]}
  ]
}""";
  }
}

Any attempt to fix the formatting will lead to unwanted whitespace in the generated json.
In other words, either the source formatting will be off , or the output formatting will be off, or both.
None of these problems occur in ~ variant discussed in a competing thread.

@lrhn
Copy link
Member Author

lrhn commented Sep 19, 2024

It should give you full control over indentation and whitespace. It might not be convenient to use that control, but whitespace inside interpolations is ignored, and whitespace outside is not.

  return jsn"""
{ $name: $value,
 "other": [$v1, $v2],
 "all": [${ for (var i = 0; i < values.length; i++) 
        ...'\n    ${values[i]}${if (i > 0) ...','}'
  }]
}
""";

(where ... stringExpression emits that content into the outer template)

@ghost
Copy link

ghost commented Sep 19, 2024

The outside whitespace is a problem. If you want to fix the source formatting in my previous example, you have to write

class A {
  String generateJson() {
    //...  
    return jsn"""
      { $name: $value,
        "other": [$v1, $v2],
        "all": [
           ${for (var i = 0; i < values; i++) ...[if (i > 0) ",", values[i]]}
        ]
      }""";
  }
}

But then, you have extra spaces at the beginning of each output line.
String literals of the form """...""" do not play well with source formatting. That's why in some languages, they allow a symbol like | to mark the actual starting position

"""
          |first line
          |second line
""";

which removes all whitespace before | and produces the output with no leading spaces:

first line
second line

But even then, it's not immediately clear how to produce the output where each value in the list is placed on a separate line, with a correct offset, like

"all": [
  1,
  2,
  3
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Proposed language feature that solves one or more problems
Projects
None yet
Development

No branches or pull requests

7 participants