-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Restrict source files to ASCII, only allowing unicode via pragma. #10607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account

Comments
I think #10326 is good enough but that's mostly because after doing some research I'm convinced that these directional marks are not very widely used, especially in source code, so any negative impact should be limited. It does have some potential issues - for example it ignores PDI which I think can terminate LRO/RLO too - but we decided to keep it simple to get it into 0.8.0 on time so I'm ignoring that for now. The pragma proposal sounds interesting. I don't know if you're aware but Python does almost exactly what you're proposing (see PEP-263). It assumes ASCII by default and to use any other encoding you have define it by inserting a magic comment at the beginning of the file: # -*- coding: utf-8 -*- And even if you don't use it you can still include any characters you want in strings - you just have to spell out their codes explicitly using the The problem I see with a pragma is that it might just become a part of the boilerplate and be included by default by everyone. I have a different idea. How about a pair of comments that you put around just the part you want to allow weird characters in? /// @push-encoding utf-8
/// повертає результат
/// @pop-encoding
contract C {
function delete() public returns (string memory) {
/// @push-encoding utf-8
return "削除しました。";
/// @pop-encoding
}
} It is a bit verbose but I think it would be fine for a language like Solidity where you're not likely to have a lot of strings and contracts are almost always public and therefore mostly documented in English anyway. Overall I think that allowing you to use anything but only if you flag it might be a better solution than banning specific characters or validating them. Unicode is too big for that and it's a moving target. Rolling out your own Unicode support is almost like rolling out your own crypto :) |
I think we are slowly moving towards this direction, and perhaps even discussed in on some calls. Having the explicit |

Yep, that's pretty much my impression about it, too, that's why I thought a more general discussion about it might be beneficial :-). |

I just want to bring this up for debate (for a future breaking release).
I'm not entirely convinced and sure that it's feasible to actually safely implement all unicode quirks with mechanisms like #10326 (@cameel seems similarly sceptical, if I understood the comments correctly - please correct me if I'm wrong).
Based on that, I'd propose the following:
Any non-ASCII character in any source file is an error by default.
However, it's possible to add
pragma source-encoding utf-8;
(or something similar - only supported values would beascii
andutf-8
) to allow unicode characters (after that pragma).The advantage of this is that this pragma would be a very clear hint to any auditor that they need to look out for unicode attacks.
This wouldn't mean that we should not still go for trying to extend unicode support, e.g. like in #10326, which we definitely should do for inclusiveness reasons alone, but it would decrease the danger in all of this.
However, one can easily argue against this:
Still, I don't think it wise to try implementing "proper" unicode support ourselves. If we really want it, I think we should fall back on an external implementation like libicu (even though this is one of the largest most annoying dependencies I have ever seen projects depending on - but it's not without reason that it is - proper unicode support is insanely complex).
EDIT: also, even though not relevant to this issue, note that if we used a complete external unicode implementation, we might also be able to safely allow unicode identifiers again, following e.g. Unicode Standard Annex 31 like C++ in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1949r5.html
The text was updated successfully, but these errors were encountered: