Restrict source files to ASCII, only allowing unicode via pragma. #10607

ekpyron · 2020-12-15T13:50:48Z

I just want to bring this up for debate (for a future breaking release).

I'm not entirely convinced and sure that it's feasible to actually safely implement all unicode quirks with mechanisms like #10326 (@cameel seems similarly sceptical, if I understood the comments correctly - please correct me if I'm wrong).

Based on that, I'd propose the following:
Any non-ASCII character in any source file is an error by default.
However, it's possible to add pragma source-encoding utf-8; (or something similar - only supported values would be ascii and utf-8) to allow unicode characters (after that pragma).

The advantage of this is that this pragma would be a very clear hint to any auditor that they need to look out for unicode attacks.

This wouldn't mean that we should not still go for trying to extend unicode support, e.g. like in #10326, which we definitely should do for inclusiveness reasons alone, but it would decrease the danger in all of this.

However, one can easily argue against this:

The fact that not using the pragma and restricting to ASCII might be preferred by auditors, would work against inclusiveness.
Still supporting utf-8 sources with the pragma does not diminish our need of implementing proper support.

Still, I don't think it wise to try implementing "proper" unicode support ourselves. If we really want it, I think we should fall back on an external implementation like libicu (even though this is one of the largest most annoying dependencies I have ever seen projects depending on - but it's not without reason that it is - proper unicode support is insanely complex).

EDIT: also, even though not relevant to this issue, note that if we used a complete external unicode implementation, we might also be able to safely allow unicode identifiers again, following e.g. Unicode Standard Annex 31 like C++ in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1949r5.html

The text was updated successfully, but these errors were encountered:

cameel · 2020-12-15T15:02:40Z

I think #10326 is good enough but that's mostly because after doing some research I'm convinced that these directional marks are not very widely used, especially in source code, so any negative impact should be limited. It does have some potential issues - for example it ignores PDI which I think can terminate LRO/RLO too - but we decided to keep it simple to get it into 0.8.0 on time so I'm ignoring that for now.

The pragma proposal sounds interesting. I don't know if you're aware but Python does almost exactly what you're proposing (see PEP-263). It assumes ASCII by default and to use any other encoding you have define it by inserting a magic comment at the beginning of the file:

# -*- coding: utf-8 -*-

And even if you don't use it you can still include any characters you want in strings - you just have to spell out their codes explicitly using the \u0000 and \x00 escapes.

The problem I see with a pragma is that it might just become a part of the boilerplate and be included by default by everyone. I have a different idea. How about a pair of comments that you put around just the part you want to allow weird characters in?

/// @push-encoding utf-8
/// повертає результат
/// @pop-encoding
contract C {
    function delete() public returns (string memory) {
        /// @push-encoding utf-8
        return "削除しました。";
        /// @pop-encoding
    }
}

It is a bit verbose but I think it would be fine for a language like Solidity where you're not likely to have a lot of strings and contracts are almost always public and therefore mostly documented in English anyway.

Overall I think that allowing you to use anything but only if you flag it might be a better solution than banning specific characters or validating them. Unicode is too big for that and it's a moving target. Rolling out your own Unicode support is almost like rolling out your own crypto :)

axic · 2020-12-15T15:09:47Z

I think we are slowly moving towards this direction, and perhaps even discussed in on some calls. Having the explicit unicode"" literal and rejecting anything unicode (except in comments) was a similar, but less harsh change.

ekpyron · 2020-12-15T15:31:26Z

Rolling out your own Unicode support is almost like rolling out your own crypto :)

Yep, that's pretty much my impression about it, too, that's why I thought a more general discussion about it might be beneficial :-).

axic added breaking change ⚠️ language design Any changes to the language, e.g. new features labels Dec 15, 2020

cameel mentioned this issue Dec 15, 2020

Scanner: Generate error on inbalanced RLO/LRO/PDF override markers. #10326

Merged

5 tasks

cameel mentioned this issue Feb 17, 2021

Re-evaluate unicode tricks in comments #10254

Closed

cburgdorf mentioned this issue Aug 3, 2021

Properly validate static strings ethereum/fe#329

Closed

cameel added low effort There is not much implementation work to be done. The task is very easy or tiny. medium impact Default level of impact labels Sep 14, 2022

ekpyron added this to the 0.9.0 milestone Feb 6, 2023

cameel mentioned this issue Feb 6, 2023

Disallow isolate Unicode characters in comments and strings #13936

Open

ekpyron removed this from the 0.9.0 milestone Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict source files to ASCII, only allowing unicode via pragma. #10607

Restrict source files to ASCII, only allowing unicode via pragma. #10607

ekpyron commented Dec 15, 2020 •

edited

Loading

cameel commented Dec 15, 2020 •

edited

Loading

axic commented Dec 15, 2020

ekpyron commented Dec 15, 2020

Restrict source files to ASCII, only allowing unicode via pragma. #10607

Restrict source files to ASCII, only allowing unicode via pragma. #10607

Comments

ekpyron commented Dec 15, 2020 • edited Loading

cameel commented Dec 15, 2020 • edited Loading

axic commented Dec 15, 2020

ekpyron commented Dec 15, 2020

ekpyron commented Dec 15, 2020 •

edited

Loading

cameel commented Dec 15, 2020 •

edited

Loading