Skip to content

Restrict source files to ASCII, only allowing unicode via pragma. #10607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ekpyron opened this issue Dec 15, 2020 · 3 comments
Open

Restrict source files to ASCII, only allowing unicode via pragma. #10607

ekpyron opened this issue Dec 15, 2020 · 3 comments
Labels
breaking change ⚠️ language design :rage4: Any changes to the language, e.g. new features low effort There is not much implementation work to be done. The task is very easy or tiny. medium impact Default level of impact

Comments

@ekpyron
Copy link
Member

ekpyron commented Dec 15, 2020

I just want to bring this up for debate (for a future breaking release).

I'm not entirely convinced and sure that it's feasible to actually safely implement all unicode quirks with mechanisms like #10326 (@cameel seems similarly sceptical, if I understood the comments correctly - please correct me if I'm wrong).

Based on that, I'd propose the following:
Any non-ASCII character in any source file is an error by default.
However, it's possible to add pragma source-encoding utf-8; (or something similar - only supported values would be ascii and utf-8) to allow unicode characters (after that pragma).

The advantage of this is that this pragma would be a very clear hint to any auditor that they need to look out for unicode attacks.

This wouldn't mean that we should not still go for trying to extend unicode support, e.g. like in #10326, which we definitely should do for inclusiveness reasons alone, but it would decrease the danger in all of this.

However, one can easily argue against this:

  • The fact that not using the pragma and restricting to ASCII might be preferred by auditors, would work against inclusiveness.
  • Still supporting utf-8 sources with the pragma does not diminish our need of implementing proper support.

Still, I don't think it wise to try implementing "proper" unicode support ourselves. If we really want it, I think we should fall back on an external implementation like libicu (even though this is one of the largest most annoying dependencies I have ever seen projects depending on - but it's not without reason that it is - proper unicode support is insanely complex).

EDIT: also, even though not relevant to this issue, note that if we used a complete external unicode implementation, we might also be able to safely allow unicode identifiers again, following e.g. Unicode Standard Annex 31 like C++ in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1949r5.html

@cameel
Copy link
Member

cameel commented Dec 15, 2020

I think #10326 is good enough but that's mostly because after doing some research I'm convinced that these directional marks are not very widely used, especially in source code, so any negative impact should be limited. It does have some potential issues - for example it ignores PDI which I think can terminate LRO/RLO too - but we decided to keep it simple to get it into 0.8.0 on time so I'm ignoring that for now.

The pragma proposal sounds interesting. I don't know if you're aware but Python does almost exactly what you're proposing (see PEP-263). It assumes ASCII by default and to use any other encoding you have define it by inserting a magic comment at the beginning of the file:

# -*- coding: utf-8 -*-

And even if you don't use it you can still include any characters you want in strings - you just have to spell out their codes explicitly using the \u0000 and \x00 escapes.

The problem I see with a pragma is that it might just become a part of the boilerplate and be included by default by everyone. I have a different idea. How about a pair of comments that you put around just the part you want to allow weird characters in?

/// @push-encoding utf-8
/// повертає результат
/// @pop-encoding
contract C {
    function delete() public returns (string memory) {
        /// @push-encoding utf-8
        return "削除しました。";
        /// @pop-encoding
    }
}

It is a bit verbose but I think it would be fine for a language like Solidity where you're not likely to have a lot of strings and contracts are almost always public and therefore mostly documented in English anyway.

Overall I think that allowing you to use anything but only if you flag it might be a better solution than banning specific characters or validating them. Unicode is too big for that and it's a moving target. Rolling out your own Unicode support is almost like rolling out your own crypto :)

@axic
Copy link
Member

axic commented Dec 15, 2020

I think we are slowly moving towards this direction, and perhaps even discussed in on some calls. Having the explicit unicode"" literal and rejecting anything unicode (except in comments) was a similar, but less harsh change.

@axic axic added breaking change ⚠️ language design :rage4: Any changes to the language, e.g. new features labels Dec 15, 2020
@ekpyron
Copy link
Member Author

ekpyron commented Dec 15, 2020

Rolling out your own Unicode support is almost like rolling out your own crypto :)

Yep, that's pretty much my impression about it, too, that's why I thought a more general discussion about it might be beneficial :-).

@cameel cameel added low effort There is not much implementation work to be done. The task is very easy or tiny. medium impact Default level of impact labels Sep 14, 2022
@ekpyron ekpyron added this to the 0.9.0 milestone Feb 6, 2023
@ekpyron ekpyron removed this from the 0.9.0 milestone Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change ⚠️ language design :rage4: Any changes to the language, e.g. new features low effort There is not much implementation work to be done. The task is very easy or tiny. medium impact Default level of impact
Projects
None yet
Development

No branches or pull requests

3 participants