Skip to content

Basic validation for character literals #184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 4, 2018
Merged

Basic validation for character literals #184

merged 1 commit into from
Nov 4, 2018

Conversation

aochagavia
Copy link
Contributor

As part of #27 I would like to add a validator for characters that detects missing quotes and too long characters. I set up a dummy implementation to get my feet wet, which generates errors whenever it finds a character.

Right now I have the following questions:

  1. The SyntaxError type seems too basic to me. I think it would make sense to have a SyntaxErrorKind instead of a msg field (we can implement Display for it so you can generate the string if desired). It should also have a TextRange instead of a TextUnit, so you can support errors that are longer than one character. Do you agree?
  2. I am manually checking whether the literal is a character (see the is_char method). Ideally, I would like to have a LiteralKind enum with variants like Int, Float, Char, String, etc. but it seems cumbersome to write all that by hand. Is there a way to specify this in grammar.ron so that the code is generated (the same way the Expr enum is generated)?

By the way, there seems to be no error reporting of panics inside the language server. When I was developing this PR I accidentally introduced a panic, which resulted in no syntax errors being shown. I knew something was wrong, because normally the vscode highlights syntax errors, but I didn't know it was caused by a panic.

@matklad
Copy link
Member

matklad commented Nov 1, 2018

The SyntaxError type seems too basic to me. I think it would make sense to have a SyntaxErrorKind instead of a msg field (we can implement Display for it so you can generate the string if desired). It should also have a TextRange instead of a TextUnit, so you can support errors that are longer than one character. Do you agree?

Agree with both points. Current SyntaxError was not really designed, it was the simplest thing that worked.

is there a way to specify this in grammar.ron so that the code is generated (the same way the Expr enum is generated)?

There definitely is a way: you can add arbitrary stuff to grammar.ron, and then handle in in generated.rs.tera. This case is slighly different, then expr though: we don't have a dedicated EXPR node in a syntax tree:

// tree for `a + b`
        BIN_EXPR@[23; 28)
          PATH_EXPR@[23; 24)
            PATH@[23; 24)
              PATH_SEGMENT@[23; 24)
                NAME_REF@[23; 24)
                  IDENT@[23; 24) "a"
          WHITESPACE@[24; 25)
          PLUS@[25; 26)
          WHITESPACE@[26; 27)
          PATH_EXPR@[27; 28)
            PATH@[27; 28)
              PATH_SEGMENT@[27; 28)
                NAME_REF@[27; 28)
                  IDENT@[27; 28) "b"

We do have a dedicated LITERAL node for literals:

// Tree for `1`
        LITERAL@[23; 25)
          INT_NUMBER@[23; 25) "92" <- INT_NUMBER is a child of literal

So, I think we need to:

  • for each literal token, like INT_LITERAL, define a dummy ast-wrapper node, like we do for comments and whitepsaces: "IntLiteral": ()
  • define a LiteralValue enum, like for expression: "LiteralValue":( enum: ["IntLiteral", ...])
  • define a options: [ ["value", "LiteralValue"] ], for Literal

@matklad
Copy link
Member

matklad commented Nov 1, 2018

By the way, there seems to be no error reporting of panics inside the language server. When I was developing this PR I accidentally introduced a panic, which resulted in no syntax errors being shown. I knew something was wrong, because normally the vscode highlights syntax errors, but I didn't know it was caused by a panic.

Good catch! Looks like we drop thread panics on the floor here: https://github.com/rust-analyzer/rust-analyzer/blob/cca5f862de8a4eb4a8990fdca95a4a7686937789/crates/ra_lsp_server/src/main_loop/mod.rs#L411-L436

@aochagavia
Copy link
Contributor Author

@matklad thanks for your feedback! I managed to validate empty and unclosed character literals. Next in the line is checking the length of the character literals, which can get quite tricky because of the big amount of ways to escape characters (see the reference for details). I think I will leave that one out for this PR.

By the way, I am using indexing based on bytes in some places and am not sure whether this is a good idea. I don't know that much about unicode, but in theory it would be possible to have a single char spanning two bytes, where the last byte is equivalent to ', right? In that case I probably need to modify the code to avoid that edge case.

So the next steps seem to be:

  1. Improve the SyntaxError type. This one is easy.
  2. Extract the code to a visitor. Maybe as a submodule in ra_syntax/validate?
  3. Write integration tests for the new syntax errors. Do we already have a place for this?

Any comments welcome, especially regarding the code itself and the last two questions.

@matklad
Copy link
Member

matklad commented Nov 1, 2018

I think I will leave that one out for this PR.

👍

Write integration tests for the new syntax errors. Do we already have a place for this?

I think it make sense to start with unit-tests. Can we extract the bulk of char validation into a function which operates on &str? Integration tests, which are also needed, could go to https://github.com/rust-analyzer/rust-analyzer/blob/cca5f862de8a4eb4a8990fdca95a4a7686937789/crates/ra_syntax/tests/test.rs

// A char always starts with an opening `'`
let text = &self.syntax().leaf_text().unwrap()[1..];

let has_closing_quote = text.len() > 0 && text.as_bytes()[text.len() - 1] == b'\'';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use .ends_with here

// The text between the literal's opening and closing `'`
pub fn text(&self) -> &str {
// A char always starts with an opening `'`
let text = &self.syntax().leaf_text().unwrap()[1..];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call self.syntax().leaf_text().unwrap() fn text (as we do for other nodes), and rename this method to something else. fn value might be a good option?

impl<'a> Char<'a> {
// The text between the literal's opening and closing `'`
pub fn text(&self) -> &str {
// A char always starts with an opening `'`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to change comment to

assert!(text.starts_with('\''), "A char always starts with an opening `'`")

(_, '\\') => text,
(_, _) => remove_closing_quote(text),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is pretty tricky, I think it might be a good idea to write tests for this.

remove_closing_quote(text)
}
_ if has_closing_quote => {
let mut last_chars = text.chars().skip(text.len() - 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I think might break for non-ascii chars. Perhpas you can use text.chars().rev()? There should be a reverse char iter somewhere...

let text = &self.syntax().leaf_text().unwrap()[1..];

let has_closing_quote = text.len() > 0 && text.as_bytes()[text.len() - 1] == b'\'';
fn remove_closing_quote(t: &str) -> &str { &t[..t.len() - 1] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we have a subtle invariant that remove_closing_quote can be called only if has_closing_quote is true. I think we can make this invariant explicit by doing something like

let without_quote = if text.ends_with('\'') { Some(&text[..text.len() - "'".len()]) } else { None }

}
}

pub fn is_closed(&self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be pub(crate) i think

@aochagavia
Copy link
Contributor Author

aochagavia commented Nov 1, 2018

Since my previous comit I have been thinking about the logic to check whether the literal has a closing quote. I realized the complexity stems from the fact that we are dealing with escape characters, even though my intention was to postpone that to another PR.

I think the proper solution would be to write a parser that takes a &str as input and produces a stream of RawChar or something like that, which you could represent as follows:

enum RawChar {
    Char(char),
    QuoteEscape,
    DoubleQuoteEscape,
    AsciiEscape(char),
    UnicodeEscape(String),
}

The same parser could then trivially check at the end whether there is a closing, non-escaped quote. Furthermore, this logic can be reused by string literals to check that they are closed as well, since strings support the same escape codes.

This also has the nice side effect of allowing you to validate the ascii and unicode escape codes, generating errors only for them if they are invalid.

What do you think? What is the right place for such a parser?

@aochagavia
Copy link
Contributor Author

Thinking more about it, we don't really need a RawChar type. We can define SINGLE_CHAR, QUOTE_ESCAPE, etc as children of CHAR, right?

@jminer
Copy link

jminer commented Nov 1, 2018

in theory it would be possible to have a single char spanning two bytes, where the last byte is equivalent to ', right?

Thought I'd mention that this isn't possible so you don't need to worry about it. UTF-8 has a great design where all bytes in a multiple-byte sequence have the highest bit set so they don't overlap with ASCII at all.

@matklad
Copy link
Member

matklad commented Nov 2, 2018

We can define SINGLE_CHAR, QUOTE_ESCAPE, etc as children of CHAR, right?

We could, but the original intention was exactly to define a separate lexer for characters/strings. This is useful to enable lazy lexing of literals, to save memory by not storing escaped string representation, and because handling error recovery is easy when you don't handle escapes.

I think it make sense to add a new module, string_lexing, for dealing with escapes. This module should operate on &str and return an iterator of lexed tokens. Then, in the ast, we wrap that module into a higher-level API. I expect that the module could grow a bit complicated, some of its tasks are:

  • detecting syntax errors due to invalid escapes
  • highlighting valid escape sequneces
  • de-escaping strings

@aochagavia
Copy link
Contributor Author

I just (force) pushed a new commit that introduces a parser for characters, based on the Rust reference. I also added a bunch of tests to ensure it works properly. Does it look like what you had in mind?

@matklad
Copy link
Member

matklad commented Nov 3, 2018

Yeah, it does look reasonable.

There are only two things which can be improved:

  • current code uses Vec and String, which means a heap allocation for each character literal. I think, we should be able to get rid of allocations here.
  • the character components are returned as values, which has a drawback that we won't be able to get ranges back (which might be useful for highlighring, eg, invalid escapes in string literals)

I think that if the CharComponent stored relative TextRange (or a pair of usize), it would provide both the ranges info, and make it easier to deal with lifetimes.

Instead of Vec<CharComponent>, we could return and impl Iterator<Item=CharComponent>. We won't be able to store such iterator inside a struct, but, for this low-level module, we don't actually need a super-encapsulated API anyway.

@aochagavia
Copy link
Contributor Author

Just pushed a new version that uses TextRange and a custom iterator. I like that it now looks much more like the existing parser. The only thing I am not sure about is having the has_closing_quote as a field on the iterator, because it will return unexpected results if you forget to evaluate the iterator until the end. Still, it seems better to me than adding a dedicated CharComponentKind to signal that the closing quote has been found.

@aochagavia
Copy link
Contributor Author

If you think the current approach is sound, then I'll go ahead and finally improve the SyntaxError. Also, I still think we want the validation logic to be somewhere else (instead of lib.rs), but I don't have a clear idea of where to put it. Any suggestions?

@matklad
Copy link
Member

matklad commented Nov 4, 2018

The only thing I am not sure about is having the has_closing_quote as a field on the iterator, because i will return unexpected results if you forget to evaluate the iterator until the end. Still, it seems better to me than adding a dedicated CharComponentKind to signal that the closing quote has been found.

SGTM: this is indeed unfortunate, but I also don't see a better way, and it should not matter much, because the logic is isolated.

, but I don't have a clear idea of where to put it. Any suggestions?

I think validation.rs with pub(crate) fn validate(file: &File) -> Vec<SyntaxError> would be nice.

If you think the current approach is sound, then I'll go ahead and finally improve the SyntaxError

Let's merge this PR first though: it already contains a bunch of useful stuff. Can you run cargo format and then r+?

@aochagavia aochagavia changed the title WIP - Prototype validators Basic validation for character literals Nov 4, 2018
@aochagavia
Copy link
Contributor Author

bors r+

bors bot added a commit that referenced this pull request Nov 4, 2018
184: Basic validation for character literals r=aochagavia a=aochagavia

As part of #27 I would like to add a validator for characters that detects missing quotes and too long characters. I set up a dummy implementation to get my feet wet, which generates errors whenever it finds a character.

Right now I have the following questions:

1. The `SyntaxError` type seems too basic to me. I think it would make sense to have a `SyntaxErrorKind` instead of a `msg` field (we can implement `Display` for it so you can generate the string if desired). It should also have a `TextRange` instead of a `TextUnit`, so you can support errors that are longer than one character. Do you agree?
1. I am manually checking whether the literal is a character (see the `is_char` method). Ideally, I would like to have a `LiteralKind` enum with variants like `Int`, `Float`, `Char`, `String`, etc. but it seems cumbersome to write all that by hand. Is there a way to specify this in `grammar.ron` so that the code is generated (the same way the `Expr` enum is generated)?

By the way, there seems to be no error reporting of panics inside the language server. When I was developing this PR I accidentally introduced a panic, which resulted in no syntax errors being shown. I knew something was wrong, because normally the vscode highlights syntax errors, but I didn't know it was caused by a panic.

Co-authored-by: Adolfo Ochagavía <[email protected]>
@bors
Copy link
Contributor

bors bot commented Nov 4, 2018

Build succeeded

@bors bors bot merged commit 9b5bbab into rust-lang:master Nov 4, 2018
@aochagavia aochagavia deleted the prototype-validators branch November 4, 2018 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants