Skip to content

[SUGGESTION] Simpler interpolated raw string literals #302

Closed
@msadeqhe

Description

@msadeqhe

Preface

I think I should explain in detail about what options are possible for string literals.

If you restrict " to be an escape sequence in character literals, e.g. programmers have to write '\"' instead of '"', in addition to disallow empty (multiple) single quotes '...', ''...'', '''...''' and etc, then the following quotes are all available syntaxes for string literals without any conflict or ambiguity for both C++2 compiler and programmers:

1)  'text'
2)  ''text''
3)  '''text'''
4)  ''''...

5)  "text"

6)  '"text"'

7)  '?"text"?'
8)  '?!"text"!?'
9)  '?!@"text"@!?'
10) '?!@#...

The ?, !, @ and # in the last four lines, can be any character except ', ", \ and new-line (because if you allow ' then it will conflict with ''...'', and if you allow " then it will conflict with '"..."', and if you allow \ then it's ambiguous with \" in character literals), if the character is an opening bracket at the opening quote, it should be the corresponding closing bracket at the closing quote, and the order of characters have to be reversed at the closing quote, e.g. 'x["text"]x'.

If you disallow to place an empty string literal side-by-side of another string literal, in addition to disallow empty (at least triple) double quotes """...""", """"..."""", """""...""""" and etc, then the following quotes are also available syntaxes for string literals without any conflict or ambiguity for both C++2 compiler and programmers:

11) """text"""
12) """"text""""
13) """""text"""""
14) """"""...

What is the current status of above quotes?

  • Number (1) is already for character literals, I leave it alone.
  • Number (5) is already for interpolated non-raw string literals, I leave it alone too.
  • Other numbers (2) to (4) and (6) to (14) are not yet taken by any language feature.

First let's consider numbers (2) to (4) (e.g. ''...'') and (11) to (14) (e.g. """..."""):

  • They cannot be empty, e.g. '''' is not an empty ''...'', but it's just an opening quote with four 's.
  • They cannot contain a single character of the same quote, e.g. ''''' doesn't contain ' inside ''...'', but it's just an opening quote with five 's.
  • They cannot contain a single character of the same quote at the beginning or ending in their content, e.g. '''text''' doesn't contain 'text' inside ''...'', but it contains text inside '''...'''.

Therefore to solve the above limits, we may allow optional white-spaces around the content of string literals. By the way, we can explore other alternative quotes.

Now, numbers (6) (e.g. '"..."') and (7) to (10) (e.g. '?"..."?') are left for us. The good news about these quotes are that their opening syntax (e.g. '"...) is different from their closing syntax (e.g. ..."'), so:

  • They can be empty, e.g. '""' is an empty '"..."'.
  • They can contain a single character of the same quote, e.g. '"""' contains " inside '"..."', and '"'"' contains ' inside '"..."'.
  • They can contain a single character of the same quote at the beggining or ending in their content, e.g. '""text""' contains "text" inside '"..."', and '"'text'"' contains 'text' inside '"..."'.

NOTE 1

Numbers (7) to (10) (e.g. '?"..."?') can have additional characters in the opening and closing quotes, other than that they are similar to number (6) (e.g. '"..."'). These additional characters are similar to R"?(...)?" in C++1, except:

  • The order of characters at the opening quote have to be reversed at the closing quote, e.g. 'abc"text"cba' contains text inside 'abc"..."cba'. Alternatively identifiers and numbers within the closing quote may keep their order as described in this comment (recommended as it improves readability).
  • If the character is an opening bracket at the opening quote, it should be the corresponding closing bracket at the closing quote, e.g. '[(<{"text"}>)]' contains text inside '[(<{"..."}>)]'.

Suggestion Detail

This is not a new issue from me. I have a similar issue before but it was cluttered in many replays in this issue, and I felt I should summerize my suggestion here.

I have to mention that my suggestion ...

  • doesn't introduce any new string format.
  • doesn't introduce any new keyword or new symbol.
  • doesn't introduce any new semantic.
  • introduces a new syntax for string literals.

Currently $R prefix is used to quote interpolated raw string literals in C++2, e.g. $R"?(text)?". But R prefix is used to quote non-interpolated raw string literals, e.g. R"?(text)?". I suggest to completely remove the prefixes, e.g. '?"text"?'.

$R"?(...)?" is a powerful way to have interpolated raw string literals but it's possible to go further and make its syntax simpler and smaller. The whole porpose of my suggestion is to transform $R"?(...)?" to '?"..."?' without any additional changes (see NOTE 1):

// := $R"(Username: (user)$)";
x0 := '"Username: (user)$"';

// := $R"x[(It's the message: "(message)$")x[";
x1 := 'x["It's the message: "(message)$""]x';

Why do I suggest this change?

$R"(...)" is a little verbose for most of the time that we just want to disable escape sequences and be able to simply write single quotes ' and double quotes " inside string literals. Using '"..."' is more readable and more convenient with less typing than $R"(...)" to start an interpolated raw string literal.

I have to mention that programmers are familiar with writing strings in quotes such as '"..."', but $R"(...)" is a little further than that and they must learn why parenthesis are not a part of content, and what is a prefix and how it can be combined with unicode prefixes.

Is there any exprience, data or working implementation available?

My suggestion is a small change. It is almost $R"?(...)?" without $R prefix (see NOTE 1).

Is there any additional suggestion?

I additionally suggest to unify interpolated and non-interpolated string literals instead of introducing different string literals for each of them, I suggest to have a way to disable captures in string literals. The pattern (...)$ captures a variable in string literals. It is complex enough that we don't often need to disable it, therefore we don't need to devote a different string literal to it.

To disable the capture pattern (expr)$, I introduce a new False Capture pattern (expr)...\$ that doesn't capture anything. We can add a back-slash before dollar sign, so the value of "(...)\$" is equal to (...)$. Also we can add more back-slashes before dollar sign, so the value of "(...)\\$" is equal to (...)\$. Each time we add a back-slash we get another one. Programmers are already familiar with escape sequences, this way is similar to escape sequence \$, but I should mention that escape sequence \\ (and other escape seqences too) doesn't have a meaning inside false capture pattern "(...)\\$", therefore each additional back-slash is excatly added to the value.

a := 0;

// The value is 0
x0 := "(a)$";

// The value is (a)$
x1 := "(a)\$";

// The value is (a)\$
x2 := "(a)\\$";

// The value is (a)\\$
x3 := "(a)\\\$";

In a nutshell, C++2 will have the following patterns in string literals:

  • Capture: (expr)$ is equal to the value of expr.
  • False Capture: (expr)...\$ is equal to the value (expr)...$. Only back-slash is allowed in place of ... after ) and before $. If you add any other character except back-slash in place of ..., then the whole pattern is violated, and it will not be a capture or false capture.

Finally there will be two string literals in C++2:

  • "..." for non-raw string literals. It supports escape sequences.
  • '"..."' for raw string literals, also '?"..."?', '?!"..."!?', '?!@"..."@!?' and etc, which the ?, !, @ and ... can be any character except ', ", \ and new-line. It doesn't support escape sequences, on the other hand, its content can be broken into multiple lines (see NOTE 1).

And we can capture or don't capture in the same string literal:

// := $R"((user)$ is a capture, but )" + R"((user)$ is not a capture.)";
x0 := '"(user)$ is a capture, but (user)\$ is not a capture."';

In the above example, a programmer have to determine if a string literal is interpolated or non-interpolated (as you see in the first line), then he can think about if (user)$ is a capture or is not a capture. But using false captures (as you see in the second line), makes it obvious that (user)\$ is not a capture.

This is a regular expression example:

// := R"(^("hi"|"hey"|"hello")$)";
x1 := '"^("hi"|"hey"|"hello")\$"';

As you see in the above example, without a back-slash before dollar sign (e.g.("hello")$) a programmer may think it's a capture in C++2 but infact it's a capture in regular expressions. Therefore using false captures (e.g. ("hello")\$) helps programmers to easily distinguish captures in C++2 and captures in regular expressions, and it brings a more readable code when dealing with regular expressions.

I mean completely disabling captures via non-interpolated string literals, may lead to less readable code, becuase a programmer have to determine if a string literal is interpolated or non-interpolated, then he can think about how to read the content of the string literal.

In this way, C++2 only have two string literals and we can control the capture anytime in a single string literal.

Edits

  1. English is not my native language. Sorry if False Capture is not a proper name for it.
  2. I've added a regular expression example and the reason of why false captures are more readable and more obvious than non-interpolated string literals.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions