Basic validation for character literals #184

aochagavia · 2018-11-01T13:37:43Z

As part of #27 I would like to add a validator for characters that detects missing quotes and too long characters. I set up a dummy implementation to get my feet wet, which generates errors whenever it finds a character.

Right now I have the following questions:

The SyntaxError type seems too basic to me. I think it would make sense to have a SyntaxErrorKind instead of a msg field (we can implement Display for it so you can generate the string if desired). It should also have a TextRange instead of a TextUnit, so you can support errors that are longer than one character. Do you agree?
I am manually checking whether the literal is a character (see the is_char method). Ideally, I would like to have a LiteralKind enum with variants like Int, Float, Char, String, etc. but it seems cumbersome to write all that by hand. Is there a way to specify this in grammar.ron so that the code is generated (the same way the Expr enum is generated)?

By the way, there seems to be no error reporting of panics inside the language server. When I was developing this PR I accidentally introduced a panic, which resulted in no syntax errors being shown. I knew something was wrong, because normally the vscode highlights syntax errors, but I didn't know it was caused by a panic.

matklad · 2018-11-01T14:57:13Z

The SyntaxError type seems too basic to me. I think it would make sense to have a SyntaxErrorKind instead of a msg field (we can implement Display for it so you can generate the string if desired). It should also have a TextRange instead of a TextUnit, so you can support errors that are longer than one character. Do you agree?

Agree with both points. Current SyntaxError was not really designed, it was the simplest thing that worked.

is there a way to specify this in grammar.ron so that the code is generated (the same way the Expr enum is generated)?

There definitely is a way: you can add arbitrary stuff to grammar.ron, and then handle in in generated.rs.tera. This case is slighly different, then expr though: we don't have a dedicated EXPR node in a syntax tree:

// tree for `a + b`
        BIN_EXPR@[23; 28)
          PATH_EXPR@[23; 24)
            PATH@[23; 24)
              PATH_SEGMENT@[23; 24)
                NAME_REF@[23; 24)
                  IDENT@[23; 24) "a"
          WHITESPACE@[24; 25)
          PLUS@[25; 26)
          WHITESPACE@[26; 27)
          PATH_EXPR@[27; 28)
            PATH@[27; 28)
              PATH_SEGMENT@[27; 28)
                NAME_REF@[27; 28)
                  IDENT@[27; 28) "b"

We do have a dedicated LITERAL node for literals:

// Tree for `1`
        LITERAL@[23; 25)
          INT_NUMBER@[23; 25) "92" <- INT_NUMBER is a child of literal

So, I think we need to:

for each literal token, like INT_LITERAL, define a dummy ast-wrapper node, like we do for comments and whitepsaces: "IntLiteral": ()
define a LiteralValue enum, like for expression: "LiteralValue":( enum: ["IntLiteral", ...])
define a options: [ ["value", "LiteralValue"] ], for Literal

matklad · 2018-11-01T15:02:40Z

By the way, there seems to be no error reporting of panics inside the language server. When I was developing this PR I accidentally introduced a panic, which resulted in no syntax errors being shown. I knew something was wrong, because normally the vscode highlights syntax errors, but I didn't know it was caused by a panic.

Good catch! Looks like we drop thread panics on the floor here: https://github.com/rust-analyzer/rust-analyzer/blob/cca5f862de8a4eb4a8990fdca95a4a7686937789/crates/ra_lsp_server/src/main_loop/mod.rs#L411-L436

aochagavia · 2018-11-01T16:51:02Z

@matklad thanks for your feedback! I managed to validate empty and unclosed character literals. Next in the line is checking the length of the character literals, which can get quite tricky because of the big amount of ways to escape characters (see the reference for details). I think I will leave that one out for this PR.

By the way, I am using indexing based on bytes in some places and am not sure whether this is a good idea. I don't know that much about unicode, but in theory it would be possible to have a single char spanning two bytes, where the last byte is equivalent to ', right? In that case I probably need to modify the code to avoid that edge case.

So the next steps seem to be:

Improve the SyntaxError type. This one is easy.
Extract the code to a visitor. Maybe as a submodule in ra_syntax/validate?
Write integration tests for the new syntax errors. Do we already have a place for this?

Any comments welcome, especially regarding the code itself and the last two questions.

matklad · 2018-11-01T18:33:24Z

I think I will leave that one out for this PR.

👍

Write integration tests for the new syntax errors. Do we already have a place for this?

I think it make sense to start with unit-tests. Can we extract the bulk of char validation into a function which operates on &str? Integration tests, which are also needed, could go to https://github.com/rust-analyzer/rust-analyzer/blob/cca5f862de8a4eb4a8990fdca95a4a7686937789/crates/ra_syntax/tests/test.rs

matklad · 2018-11-01T17:38:07Z