Skip to content

[SUGGESTION] Quote string literals with backticks #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
msadeqhe opened this issue Mar 22, 2023 · 4 comments
Closed

[SUGGESTION] Quote string literals with backticks #289

msadeqhe opened this issue Mar 22, 2023 · 4 comments

Comments

@msadeqhe
Copy link

msadeqhe commented Mar 22, 2023

This suggestion is a rework from this issue. The syntax of my suggestion is not important, and it can be anything that fits into C++2. I know I must keep things simple and obvious in my suggestion. To do this, I should minimize concepts and keep the syntax familiar to programmers as much as I can. We can have the following string literals:

  • String literals
  • Interpolated string literals
  • Raw string literals
  • Raw interpolated string literals (soon)
  • A new format for string literals (maybe in the future)

In the future, C++2 may introduce more string literals as well. Does it resemble this video about C++1 initialization? Maybe a little. But one string literal is enough, because they can be fundamentally the building block of producing other string literals. In other words, C++2 internally can join multiple raw string literals, escape sequences, captures and other language expressions to produce the string value.

I want to suggest a radical change to string literals by starting from the begining of how to write string literals. 3 common symbols double quote ", single quote ' and backtick ` are suitable to quote string literals. If we look at how sentences are written in English, it would be obvious that double quote " and single quote ' are more often used than backtick `, also an analysis is available here that is interesting because double quote " is more frequency used than single quote '. Therefore backtick ` is an appropriate symbol to be the only escape character in string literals, because it's not a common punctation mark in English and most of the other languages, also it was mainly designed for typewriters as described here, maybe that is why markup languages such as Markdown use backtick ` to create inline code inside normal text. It should be explained that JavaScript uses backtick ` for template literals, also D and Go use it for raw string literals. So, backtick ` should be the only character that have a special behaviour in string literals.

String literals will be quoted inside backticks `, and they don't understand escape sequences and captures until we put them inside a nested backtick `. Captures may have extra parenthesis for expressions, or when escape sequences are beside them. For example:

// "text"
   `text`

// "first\nsecond\nlast"
   `first`\n`second`\n`last`

// "You bought this (object)$ yesterday."
   `You bought this `object$` yesterday.`

// "I know 2 * 2 is (2 * 2)$."
   `I know 2 * 2 is `(2 * 2)$`.`

// "Name: (user)$, Age: (age)$"
   `Name: `user$`, Age: `age$``

// "Name: (user)$\nAge: (age)$"
   `Name: `(user)$\n`Age: `age$``

// "Name: \t(name)$\nAge: \t(age)$"
   `Name: `\t(name)$\n`Age: `\t(age)$``

To write a backtick ` inside a string literal, we can write double backticks ``. String literals placed side-by-side are concatenated, but a white-space should be between them otherwise they will be treated like a single string literal which contains double backticks ``. For example:

// "This is a backtick `"
   `This is a backtick ```

// "User-name"
   `User-name`

// "User`-`name"
   `User``-``name`   //--> White-space is not between them.

// "User""-""name"
   `User` `-` `name` //--> White-space is between them.

// "User"    "-"    "name"
   `User`    `-`    `name`

In a nutshell, `User``-``name` is not equal to `User` `-` `name`.

The goal of my suggestion is to keep it simple to teach and familiar to programmers, that's why I keep symbol \ for escape sequence such as \n whereas I could remove or change it in my suggestion.

String Expression

As you can see, the syntax is similar to current C++2. Programmers put nested backtick expressions inside string literals, although it can be viewed a little bit different that I'll explain in the next paragraph.

Consider string literal: `Name: `(user)$\n`Age: `age$``, let's call it a string expression, it is a combination sequence of the following elements respectively which has to both start and end with a string literal:

  • string literal `Name: `
  • capture (user)$
  • escape sequence \n
  • string literal `Age: `
  • capture age$
  • an empty string literal ``

String expressions can have one of encoding prefixes L, u8, u or U, and they can have suffixes:

// u8 is the prefix and s is the suffix
// u8"Name: (user)$\nAge: (age)$"s
   u8`Name: `(user)$\n`Age: `age$``s

But that's not enough without character literals.

A string literal is a sequence of character literals, that's why I have to also consider character literals. Character literals like before, can have escape sequences, but the notation is c`...`. For example:

  • 'n' becomes c`n`
  • '\n' becomes c`\n`
  • '\x{6e}' becomes c`\x{6e}`
  • '' doesn't have any meaning in C++2, becuase character literals cannot be empty and c`` is the backtick ` itself.

Character literals placed side-by-side are not concatenated. Multi-character literals must have prefix b which means 'ABCD' becomes b`ABCD`, because multi-character literals have a different underlying type, they should be visually different. For example:

x1 := c`A` c`B` c`C` c`D`; // ERROR!
x2 := c`A`  `B`  `C`  `D`; // ERROR!
x3 := c`ABCD`; // ERROR!
x4 := b`ABCD`; // OK.

We can use other notations for character literals, my recommended notation c`...` has two benefits:

  • It's not possible to have an empty character literal, c`` is simply the backtick ` itself (similar to double backticks inside string literals).
  • Only backtick ` is enough for both string literals and character literals, and if C++2 use underline _ (or backtick `) instead of single quote ' as digit separator e.g. 1'500'444 becomes 1_500_444 (similar to Python language) (or 1`500`444), then it's possible to reserve double quotes " and single quotes ' for future use either as new operators or new literals.

Will your feature suggestion eliminate X% of security vulnerabilities of a given kind in current C++ code?

No.

Will your feature suggestion automate or eliminate X% of current C++ guidance literature?

Yes. It will do in the following ways:

  • Unifying
    • If we only have one string literal, all questions about when to use which one of string literals, will be gone.
  • Simplicity
    • Having only one string literal is simpler and easier to teach than teaching students about the differences between string literals and why we have n-number of them.
  • Integration
    • In this way, interpolated string literals will be integrated into the language, this will allow new features to be added without introducing new escape character (such as \ or ()$ or etc) for each feature in string literals, because in addition to escape sequences and captures, another new expressions can be added later. The point is, all of them are available just with a single backtick ` instead of introducing new escape characters inside string literals such as \ or ()$ or etc. A single backtick ` may end the string literal, may be a backtick itself (with double backticks ``) and may be an escape sequence (`\...`) or a capture (`...$`) or a combination of them. In addition, more expressions can be allowed besides escape sequences and captures.

Will your feature suggestion remove unnecessary syntax or concepts?

Yes, my suggestion is a little verbose. Backtick ` will be used for quoting both string literals and character literals. Also if we use underline _ or backtick ` as digit separator (e.g. 1_500_444 or 1`500`444), then it allows C++2 to use both double quotes " and single quotes ' either as new operators or new literals. For example:

x := n' * m";

Also escape sequences \' and \" for quotes are not needed anymore, and escape sequence \` is not needed for backtick.

@msadeqhe
Copy link
Author

msadeqhe commented Mar 22, 2023

Here is an example of multi-line string literal:

element := `code`;
formula := `2 + 2`;
x := `<div class="user-`element$`">
  The result of ``value`` is:
  <`element$`>
    formula = `\t(formula)$`
    value = `\t(2 +2)$`
    type  = `\t`integer
  </`element$`>
<div>`;

// This is the produced string:
/**
<div class="user-code">
  The result of `value` is:
  <code>
    formula =   2 + 2
    value =     4
    type  =     integer
  </code>
<div>
 */

@gregmarr
Copy link
Contributor

So this code:

`<div class="user-`element$`">`

is equivalent to cpp1

"<div class=\"user-" + element + "\">"

and this code

`The result of ``value`` is:`

is equivalent to cpp1:

"The result of `value` is:"

So the parsing is that two consecutive backticks inside a backtick string are a literal backtick instead of stopping one string and starting another?

@msadeqhe
Copy link
Author

Yes, that's it.

@hsutter
Copy link
Owner

hsutter commented Mar 23, 2023

I want to suggest a radical change to string literals ...

Thanks, I do appreciate the interest and the thoughtful ideas here.

Sorry to decline, but I'm not going to pursue this direction for now. Three things:

  • I just merged @filipsajdak's raw string literal PR, which covers some of the same use cases such as convenient HTML generation.
  • For now I'm planning to stick to the experiment of the general capture syntax (thing)$ everywhere in the language (not just string literals, but also contracts and lambdas) for the reasons given in Design Note: Capture. I get the argument for consistency among string literals, but I'm currently putting heavier weight on seeing if a consistency of all capture across the language pans out.
  • If in the future I were interested in something like this that improves string/character literals, for a proposed syntax that's as novel (to C++ers) this one is, I'd want to know about implementations and usage experience of similar designs. In other words, is it a novel paper design that might or might not work (I note it is still evolving when comparing it to the earlier issue this was derived from)? The more unfamiliar the design/syntax, and the more it's still evolving, the more I'd want to see data like that to measure bakedness. That doesn't mean the idea is bad, not at all, it could be awesome; it just means there's more risk, and more data required.

Sorry to say no (or at least "not yet") to this, but again thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants