-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Basic validation for character literals #184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Agree with both points. Current SyntaxError was not really designed, it was the simplest thing that worked.
There definitely is a way: you can add arbitrary stuff to
We do have a dedicated
So, I think we need to:
|
Good catch! Looks like we drop thread panics on the floor here: https://github.com/rust-analyzer/rust-analyzer/blob/cca5f862de8a4eb4a8990fdca95a4a7686937789/crates/ra_lsp_server/src/main_loop/mod.rs#L411-L436 |
@matklad thanks for your feedback! I managed to validate empty and unclosed character literals. Next in the line is checking the length of the character literals, which can get quite tricky because of the big amount of ways to escape characters (see the reference for details). I think I will leave that one out for this PR. By the way, I am using indexing based on bytes in some places and am not sure whether this is a good idea. I don't know that much about unicode, but in theory it would be possible to have a single So the next steps seem to be:
Any comments welcome, especially regarding the code itself and the last two questions. |
👍
I think it make sense to start with unit-tests. Can we extract the bulk of char validation into a function which operates on |
crates/ra_syntax/src/ast/mod.rs
Outdated
// A char always starts with an opening `'` | ||
let text = &self.syntax().leaf_text().unwrap()[1..]; | ||
|
||
let has_closing_quote = text.len() > 0 && text.as_bytes()[text.len() - 1] == b'\''; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use .ends_with
here
crates/ra_syntax/src/ast/mod.rs
Outdated
// The text between the literal's opening and closing `'` | ||
pub fn text(&self) -> &str { | ||
// A char always starts with an opening `'` | ||
let text = &self.syntax().leaf_text().unwrap()[1..]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call self.syntax().leaf_text().unwrap()
fn text
(as we do for other nodes), and rename this method to something else. fn value
might be a good option?
crates/ra_syntax/src/ast/mod.rs
Outdated
impl<'a> Char<'a> { | ||
// The text between the literal's opening and closing `'` | ||
pub fn text(&self) -> &str { | ||
// A char always starts with an opening `'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good idea to change comment to
assert!(text.starts_with('\''), "A char always starts with an opening `'`")
crates/ra_syntax/src/ast/mod.rs
Outdated
(_, '\\') => text, | ||
(_, _) => remove_closing_quote(text), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is pretty tricky, I think it might be a good idea to write tests for this.
crates/ra_syntax/src/ast/mod.rs
Outdated
remove_closing_quote(text) | ||
} | ||
_ if has_closing_quote => { | ||
let mut last_chars = text.chars().skip(text.len() - 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I think might break for non-ascii chars. Perhpas you can use text.chars().rev()
? There should be a reverse char iter somewhere...
crates/ra_syntax/src/ast/mod.rs
Outdated
let text = &self.syntax().leaf_text().unwrap()[1..]; | ||
|
||
let has_closing_quote = text.len() > 0 && text.as_bytes()[text.len() - 1] == b'\''; | ||
fn remove_closing_quote(t: &str) -> &str { &t[..t.len() - 1] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we have a subtle invariant that remove_closing_quote
can be called only if has_closing_quote
is true. I think we can make this invariant explicit by doing something like
let without_quote = if text.ends_with('\'') { Some(&text[..text.len() - "'".len()]) } else { None }
crates/ra_syntax/src/ast/mod.rs
Outdated
} | ||
} | ||
|
||
pub fn is_closed(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be pub(crate)
i think
Since my previous comit I have been thinking about the logic to check whether the literal has a closing quote. I realized the complexity stems from the fact that we are dealing with escape characters, even though my intention was to postpone that to another PR. I think the proper solution would be to write a parser that takes a
The same parser could then trivially check at the end whether there is a closing, non-escaped quote. Furthermore, this logic can be reused by string literals to check that they are closed as well, since strings support the same escape codes. This also has the nice side effect of allowing you to validate the ascii and unicode escape codes, generating errors only for them if they are invalid. What do you think? What is the right place for such a parser? |
Thinking more about it, we don't really need a |
Thought I'd mention that this isn't possible so you don't need to worry about it. UTF-8 has a great design where all bytes in a multiple-byte sequence have the highest bit set so they don't overlap with ASCII at all. |
We could, but the original intention was exactly to define a separate lexer for characters/strings. This is useful to enable lazy lexing of literals, to save memory by not storing escaped string representation, and because handling error recovery is easy when you don't handle escapes. I think it make sense to add a new module,
|
I just (force) pushed a new commit that introduces a parser for characters, based on the Rust reference. I also added a bunch of tests to ensure it works properly. Does it look like what you had in mind? |
Yeah, it does look reasonable. There are only two things which can be improved:
I think that if the Instead of |
Just pushed a new version that uses |
If you think the current approach is sound, then I'll go ahead and finally improve the |
SGTM: this is indeed unfortunate, but I also don't see a better way, and it should not matter much, because the logic is isolated.
I think
Let's merge this PR first though: it already contains a bunch of useful stuff. Can you run |
bors r+ |
184: Basic validation for character literals r=aochagavia a=aochagavia As part of #27 I would like to add a validator for characters that detects missing quotes and too long characters. I set up a dummy implementation to get my feet wet, which generates errors whenever it finds a character. Right now I have the following questions: 1. The `SyntaxError` type seems too basic to me. I think it would make sense to have a `SyntaxErrorKind` instead of a `msg` field (we can implement `Display` for it so you can generate the string if desired). It should also have a `TextRange` instead of a `TextUnit`, so you can support errors that are longer than one character. Do you agree? 1. I am manually checking whether the literal is a character (see the `is_char` method). Ideally, I would like to have a `LiteralKind` enum with variants like `Int`, `Float`, `Char`, `String`, etc. but it seems cumbersome to write all that by hand. Is there a way to specify this in `grammar.ron` so that the code is generated (the same way the `Expr` enum is generated)? By the way, there seems to be no error reporting of panics inside the language server. When I was developing this PR I accidentally introduced a panic, which resulted in no syntax errors being shown. I knew something was wrong, because normally the vscode highlights syntax errors, but I didn't know it was caused by a panic. Co-authored-by: Adolfo Ochagavía <[email protected]>
Build succeeded |
As part of #27 I would like to add a validator for characters that detects missing quotes and too long characters. I set up a dummy implementation to get my feet wet, which generates errors whenever it finds a character.
Right now I have the following questions:
SyntaxError
type seems too basic to me. I think it would make sense to have aSyntaxErrorKind
instead of amsg
field (we can implementDisplay
for it so you can generate the string if desired). It should also have aTextRange
instead of aTextUnit
, so you can support errors that are longer than one character. Do you agree?is_char
method). Ideally, I would like to have aLiteralKind
enum with variants likeInt
,Float
,Char
,String
, etc. but it seems cumbersome to write all that by hand. Is there a way to specify this ingrammar.ron
so that the code is generated (the same way theExpr
enum is generated)?By the way, there seems to be no error reporting of panics inside the language server. When I was developing this PR I accidentally introduced a panic, which resulted in no syntax errors being shown. I knew something was wrong, because normally the vscode highlights syntax errors, but I didn't know it was caused by a panic.