-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Closed
Description
The byte sequences "0x27 0x0a 0x27", "0x27 0x0d 0x27", and "0x27 0x27 0x27" (newline, carriage return, and single-quote, respectively, sandwiched between single quotes) are accepted as character literals. The former two are, as far as I can tell, allowed per the manual's description of the language, but would not feature in any sane language; I assume this is merely an oversight. The latter is rejected by the grammar described in the manual but accepted by the compiler. Presumably the manual is the authority and the compiler is wrong to accept '''.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
Kimundi commentedon Jul 22, 2013
"Presumably the manual is the authority and the compiler is wrong to accept" - HA! (Hint: Don't trust either)
In all seriousness... I don't think any of those examples can be seen as illegal. Confusing yes, but not wrong: You write a
'
, followed by one unicode codepoint, either directly embedded as utf8 or as escaped string, and close with another'
.Because it has to be exactly one codepoint, the parser has no problem with it being the same character used to delimit it, and because ASCII is a subset of utf8, both
\n
and\r
are valid character values.sp3d commentedon Jul 22, 2013
The ''' is a less clear case; I'm fine with whichever behavior as long as we make docs and compiler agree.
On the other hand, anyone using a literal newline like that deserves to be shot. Various transports used in the real world for code don't preserve line-endings: ftp, web services like pastebins, IRC, and anything using a "text" rather than "binary" mode in its I/O libs will corrupt literals of the 0x27 0x0a 0x27 or 0x27 0x0d 0x27 format, in some cases resulting in code that won't compile anymore (if it turns the sequence into 0x27 0x0d 0x0a 0x27), but in other cases silently changing semantics by converting the 0x0d into 0x0a. Literals of that form also result in indentation violations, so naïve auto-indent will also break said literals. Therefore, even though they are technically legal at present, it seems insane to leave it that way. Languages like C, Java, etc. similarly disallow unescaped 0x0a and 0x0d char literals.
bstrie commentedon Jul 22, 2013
Nominating for Well-Defined.
Kimundi commentedon Jul 23, 2013
One thing first: All the things you talked about are also true for string literals, so we need to think about them too.
So, you're right no one should actually do this, but I don't see that as a reason to only forbid those two. Rust source is utf8, you will have those problems with other byte sequences too.
If you are in a situation where
\n
and\l
cause trouble with external tools: Well, don't use them in your source or change the tools.But even if it's better to forbid them, it seems arbitrary to only exclude those two codepoints in a literal. What about the other ascii ctrl characters? All the other utf8 sequences that might trip up external tools? A rule like "All non-printable codepoints in the ascii range need to be annotated in escaped form" would at least be better in that case.
bstrie commentedon Jul 23, 2013
If there's precedent in Java and C disallowing certain character literals, then that's a reasonable argument for us to disallow them as well. But the only reason I say this is because we can cite precedent, because it does seem somewhat arbitrary.
sp3d commentedon Jul 24, 2013
@Kimundi: I agree that we should probably give string literals some related scrutiny. I believe the primary reason they are forbidden in other languages is that character literals are only allowed to span a single line, and these characters are those which terminate lines.
@bstrie: None of C, C++, or Java allows unescaped \r or \n in character literals (in C and C++ the interpretation of what constitutes newlines is up to compilers to an extent but gcc and clang behave as described):
From §2.14.3 of the latest C++ draft:
"character-literal:
’ c-char-sequence ’
u’ c-char-sequence ’
U’ c-char-sequence ’
L’ c-char-sequence ’
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except the single-quote ’, backslash , or new-line character
escape-sequence
universal-character-name"
And in the Java SE 7 language spec, §3.10.4:
"CharacterLiteral:
' SingleCharacter '
' EscapeSequence '
SingleCharacter:
InputCharacter but not ' or "
where (§ 3.4)
"InputCharacter:
UnicodeInputCharacter but not CR or LF"
catamorphism commentedon Sep 12, 2013
Accepted for well-defined
pnkfelix commentedon Sep 12, 2013
cc me
Disallow char literals which should be escaped
auto merge of #9335 : alexcrichton/rust/issue-7945, r=thestinger
Auto merge of rust-lang#7945 - Serial-ATA:issue-7934, r=flip1995