-
Notifications
You must be signed in to change notification settings - Fork 18.1k
spec: clarify tokenization of literals #28253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For what it's worth, this isn't limited to octal. I find this equally confusing:
If abcd1234 is one token, it's quite surprising for 1234abcd to be two tokens. I agree that 012389 being two tokens is more surprising, but only a tiny bit more. (If it ever makes a difference - that is, if we have a valid program one way but not another - that would be seriously problematic. Hopefully it does not, in which case this probably doesn't matter much.) |
To address this issue generally we could modify the primary tokenization rule. Currently we have:
If we change this to something like:
(When tokenizing char and string literals, we may want to look for the closing quote before deciding if the literal's interior is valid; we don't want to stop in the middle because of an invalid escape sequence. If we want to be absolutely precise in the spec, the interior syntax could be separated from the general char or string literal syntax.) This would eliminate the need for any special cases for octals, hexadecimal floats, incomplete exponents, etc. in the spec. On the implementation side it would eliminate the need for any backtracking; a simple and consistent 1-char look-ahead would be sufficient. This would simplify lexing, and perhaps even speed it up a tiny bit (though this is unlikely to have any impact on compiler speed).
This behavior is essentially what we are doing now anyway (though gccgo tokenizes There is one place where we currently (Go 1.12) use 2-char lookahead, and that is for However, your (@rsc) example above, Alternatively, instead of changing the tokenization rule, it might be better to make the suggested change an implementation restriction. After all, the existing tokenization rule describes correct programs. |
Change https://golang.org/cl/161417 mentions this issue: |
Based on feedback on https://golang.org/cl/161417 and the latest scanner implementations for Go 2 number literals, I am going to close this issue. I agree that the spec describes correct programs and that we shouldn't expand it to describe compiler behavior in the presence of incorrect programs. Second, the scanners for the Go 2 number literals are now accepting incorrect literals liberally and in return can provide more informative error messages than if the scanners would stick literally to the basic tokenization rule, or even the relaxed (suggested) prefix tokenization rule. Finally, all the std library scanners now behave essentially the same when it comes to Go 2 number scanning, as they all use the same code outline. |
Per the spec section on Tokens (https://golang.org/ref/spec#Tokens):
For instance, the source
should be be tokenized into the integer literal
0
and the identifiery
according to this rule. All the compilers agree with this, deducing from the error messages for this program:Here are the errors reported by cmd/compile, gccgo, and gotype:
However, the rule is not strictly followed for the source
cmd/compile and gotype report:
Only gccgo produces an error consistent with the previous example and the spec:
cmd/compile and gotype both assume that
0x
is the beginning of a hexadecimal number and both report an error when that number doesn't materialize; yet the longest sequences of characters that form valid tokens here are (as before):0
andx
.Finally, for the source
all compilers deviate from the spec rule:
as all assume this to be an octal constant. Yet, per the spec, this should be tokenized into two integer literals
01234567
and89
.The implementation problem here is that we don't know if a sequence
012345678
is simply the octal literal01234567
followed by the integer literal8
or whether this turns out to be a valid floating-point constant0123456789.0
had we kept reading. To make the right decision, a tokenizer must keep reading until there's a definitive answer. If the "longest-possible" tokenization fails, per the spec, the correct answer would require the tokenizer to go back, which may not be easily possible (say, if the implementation reads the source via an io.Reader). Worse, because there's virtually no size limit for octal literals, if backtracking is not an option, arbitrarily long look-ahead would be needed.A similar problem arises for floating point numbers, for instance
1.2e-f
should be tokenized as1.2
,e
,-
,f
but cmd/compile and gotype complain about an invalid floating point number; only gccgo appears to be doing the right thing. The problem is not as bad here because the required look-ahead is limited (3 characters at most).Octal constants pose unique tokenization requirements non-existent for other tokens if we want to strictly stick by the tokenization rule provided by the spec.
The simplest "fix" is to adjust the spec such that it permits implementations to deviate from the "longest sequence" requirement for numeric literals, and perhaps for octal literals only (the latter is what gccgo appears to be doing).
cc: @ianlancetaylor @mdempsky for commentary, if any.
The text was updated successfully, but these errors were encountered: