Skip to content

URI shorthands without internal whitespace. #3983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lrhn opened this issue Jul 15, 2024 · 6 comments
Closed

URI shorthands without internal whitespace. #3983

lrhn opened this issue Jul 15, 2024 · 6 comments
Labels
feature Proposed language feature that solves one or more problems unquoted-uris The unquoted URI feature

Comments

@lrhn
Copy link
Member

lrhn commented Jul 15, 2024

The proposal for URI shorthands allows sequence of tokens of the form: <dottedIdentifierList> ('/' <dottedIdentifierList>)* as unquoted imports.

It's a syntactic grammar, not a lexical grammar, which means that it does not affect tokenization, which would otherwise need to be context sensitive, and so it can reuse existing grammar productions.

As currently specified, it allows whitespace and comments between the tokens (it doesn't say it doesn't, and syntactic grammar rules do allow whitespace).

I propose to not allow whitespace or comments inside the URI shorthand.
Further, I suggest to require the URI shorthand to be terminated by a whitespace or semicolon. Any other adjacent token will look like it should be part of the URI, but it isn't. It's better to make the URI clearly delimited, by whitespace (with no internal whitespace) or with a final ; which users are used to reading at end of lines.

This change would not affect tokenization, it would just stop parsing the token sequence as an unquoted URI when there is whitespace or comments after a token, and it would make it an error to have any non-allowed token touching the URI.

I believe this to be less error-prone, especially around show/hide and that any actual internal whitespace or comments will hurt readability signficantly.
If someone takes a quoted URI and removes just the quotes, package: and .dart, they may get any number of parsing errors if the URI is not of a form supported by the shorthand. By including all the tokens up to the first whitespace, it's easier to point to the position inside the supposed URI where the unsupported character is, rather than reporting those tokens as "unexpected" after the URI. Including the tokens in the URI is, my hypothesis, more likely to match user intent, and therefore doing so gives more useful error messages.

Not allowing other following tokens should extend to comments. This would even disallow adjacent EOL-comments, so import foo/bar//baz; would be an error, because it's misleading. If the import should end there, adding a space is not a problem. If it shouldn't, it's probably an error.
(This does affect formatting. If someone writes:

import foo/bar// come comment here
  as banana;

and expects the formatter to insert a space, it won't. That is a real issue, and may be enough reason to allow an EOL-comment to follow an unquoted URI anyway. Even if it then won't help with a import foo/bar//baz; mistyping.)

This also includes string literals in the included tokens, which is highly confusing.
It should be an immediate error if a string literal is adjacent to an unquoted URI.
Since a string literal may contain whitespace, it feels weird to continue the unquoted URI after the string.
We may want to consider making string literals terminate unquoted URIs too, and then it's always an error if they're following the unquoted URI, adjacent or not. (In that case, an unquoted URI is terminated by a whitespace, semicolon or string literal).

While the grammar is unambiguous (because there must be a . or / between any two identifiers), it can require some look-ahead to figure out what an error should be. That is, it may make error recovery harder.

Example:

import mypkg/src/show. as hide hide show show banana;

This is invalid because it parses as .../show.as hide hide show show banana; and there is a comma missing before banana.
If the . was the mistake, then removing it would make the import valid as .../show as hide hide show show banana; with all of an as, hide and show clause. (The as identifier is a built-in identifier, hide and show are only contextually keywords.)
Nobody writes both hide and show. Could we make unquoted imports only allow one of show and hide, just to make things easier for ourselves? You never need more than one.

Some error reporting examples:

import mypkg/src/time. show banana; // Forgot to delete the `.` when deleting `.dart`.
import mypkg/src/time show banana; // What was intended.

If internal whitespace is not allowed, the first import would give an error of "missing identifier after ." rather than "unexpected token: banana".

import mypkg/src//time;
import show;

Here the error will be "unexpected token: import" unless we disallow adjacent EOL-comments, then it will be "//time; comment not allowed in unquoted URI.

import mypkg/src/time+date;

would give an error of "+ not allowed in unquoted URI" rather than "unexpected identifier +, expected on of as, show, hide or ;".

This would have the effect of making whitespace significant, and therefore it can affect automatic formatting.
That is another reason I want to include all adjacent tokens into the URI. The code can still be formatted, with its malformed URI.

Grammatically, the grammar of the unquoted URI would be:

-- Grammar rules
<uri> ::= <stringLiteral> | <unquotedUri> 
<unquotedUri> ::= <unquotedUriToken> (~(WS|';') <unquotedUriToken>)* LOOKAHEAD(WHITESPACE | ';')
<validUnquotedUri> ::= <uriWords> (~WS '/' ~WS <uriWords>)*
<uriWords> ::= (<uriWord> ~WS '.' ~WS)* <uriWord>
<uriWord> ::= <identifier>

<unquotedUriToken> ::= *any* token other than WHITESPACE, COMMENT or ';'.

-- Tokenizer rule
WS := WHITESPACE | COMMENT

where the ~WS is a one-token negative lookahead, and LOOKAHEAD(...) is a positive look-ahead.
The parsing will collect any token up to the next whitespace or semicolon, then check if those tokens can be produced by <validUnquotedUri>, rather than trying to only collect enough to be valid, and then leave the rest.

This also has the advantage that if we extend the grammar of valid unquoted URIs, we don't need to change the code that collects the URIs, only the validation.

TL;DR: The unquoted URI is undelimited by punctuation. I think it's important for readability and error recovery to make it clearly delimited by whitespace, which can be done by not allowing internal whitespace or comments, including tokens until a semicolon or whitespace (or possibly a /*..*/ comment, but not a //comment because it can be a mistyped path), and reporting an error if those tokens do not form a valid unquoted URI, rather than just stopping before the unexpected adjacent token.
I believe this has a higher chance of catching mistakes that assume more characters are allowed in unquoted URIs than actually the case, and therefore giving better error messages.

In practice, all unquoted URIs will be terminated by whitespace or semicolon, so the restriction only really affects error recovery and error reporting (positively).

@lrhn lrhn added feature Proposed language feature that solves one or more problems unquoted-uris The unquoted URI feature labels Jul 15, 2024
@lrhn lrhn changed the title URI shorthands with not internal whitespace. URI shorthands without internal whitespace. Jul 15, 2024
@munificent
Copy link
Member

I have to admit that I don't find this very compelling.

Yes, if a user writes weird and/or incorrect code in an unquoted import, they may get somewhat confusing error messages and error recovery. But that's true pretty much everywhere, isn't it?

You can write:

foo(',','',''',',',''',);

That code is certainly hard to read and error-prone. But we allow it and don't, say, require whitespace after ,.

Or, heck:

print('//'/*'//'/*'/// *//'*/'*/');

I honestly just don't think users will do weird stuff in imports and I do think there is value in keeping the grammar for imports simple. Even if our scanner and parser do the right thing if we make the grammar more sophisticated, it's unlikely that treesitter, syntax highlighters, and all of the other various lightweight tools that process Dart source code will.

This just doesn't seem to me like a corner of the language where there's a lot of positive value in putting much sophistication in the feature. No one really cares about writing imports. They're just a bit of toil required to use other libraries. So I lean towards keeping it as simple as possible.

@lrhn
Copy link
Member Author

lrhn commented Jul 25, 2024

I actually see allowing (and then ignoring) embedded whitespace as an unnecessary sophistication, or complication.
It's an implementation detail leaking through to the user experience.

As a user, I would expect the parser to recognize what I write as a single file reference. If it allows spaces inside, but only at some particular positions that relate to "tokens", a concept entirely unrelated to the file path I'm writing, of expect them to be part of the path, and not just ignored.

Or rather, if we designed this feature "perfectly" (ignoring the current language and implementation it goes into), we would have custom tokenization for unquoted file paths, triggered by the preceding import etc. keyword.

If we had designed that, we would not allow embedded ignored whitespace, or disallow reserved words out or leading digits. We'd just allow alphanumeric sequences separated by . and /, because that'd actually be the simplest tokenization.

We chose to allow parsers to do while-file tokenization first, which prevents, or complicates, having context dependent tokenization.
So we simulate that tokenization by combining existing tokens designed for something else.

And I think we should make that simulation as true to what we would have designed without those constraints, because it's easier to explain to users, IMO easier to read, doesn't have arbitrary restrictions, and it doesn't burden the user unnecessarily with our implementation details.
Especially if it's not even very hard. No embedded whitespace and comments, and allowing reserved words are both trivial. Allowing leading digits is easy, but requires a little care because of the embedded ..

@munificent
Copy link
Member

Or rather, if we designed this feature "perfectly" (ignoring the current language and implementation it goes into), we would have custom tokenization for unquoted file paths, triggered by the preceding import etc. keyword.

For what it's worth, that's not how I would design things even if I could. It just makes life harder for everyone writing an IDE, syntax highlighter, etc. for next to no benefit.

@lrhn
Copy link
Member Author

lrhn commented Aug 13, 2024

I don't see that being much of a problem. It requires contextual matching, but it does anyway.

If we disallow spaces and allow any number/identifier/keyword sequence, then the pattern to recognize
is '\w+(?:[./+\-])\w+)* when following import|export|part. That's fairly easy.

If we allow internal space, maybe even comments, do not allow keywords, but do allow built-in identifiers (which we do), then you suddently have to figure out what import show / show / as as show show as; means. (Not that anyone would write that, but you won't get show/show/as being a single color any more.)

If I were a syntax highlighter will most likely give up on that and hightlight built-in identifiers and contextual keywords, because it's easier. If we allow comments between tokens, it gets even harder.
You could try to recognize \w+(<whitespace>?[./\-]<whitespace>?\w+ anyway, where <whitespace> also includes comments.

I think import etc. being followed by something of the form \w+(?:[./\-+]\w+)* is making things easy for everybody, at least those who isn't aren't working with tokens. For those, we have helper functions.
And if they are working with tokens, then allowing reserved words is not harder than disallowing them.
Disallowing embedded spaces and comments is easy.

@munificent
Copy link
Member

We discussed this in the language meeting this morning and in a follow-up meeting. We've decided that as this issue proposes, we will be restrictive and not allow internal whitespace or comments inside the path part of an unquoted import. That means these are all errors:

import spaces  /  in_path;
import spaces  .  in_component;
import comment /* hi */ / in_path;
import comment /* hi */ . in_component;
import line // hi
    . comment;

An expected consequence of this is that each unquoted import and export directives will fit on a single line since you can't use a newline inside to split it (except for after the import or export keyword). That's OK. We already consider it an allowed exception for import and export URIs to overflow the page width. One goal of this feature is to make imports and exports shorter, so they should need this exception less often.

Whitespace and comments are allowed before and after the path part. So these are all OK:

import /* hi */ some/path;
import


    many.newlines;
import comment/after; // hi
import weird.but/ok /* hi */ ;

In short, path part of an import directive always begins and ends with an identifier (which may be a reserved word according to #3984). Between those identifiers and all of the tokens between them, we don't allow whitespace or comments. Elsewhere we do.

The motivation for this is a mixture of:

  • Whether we like it or not, users do sometimes write simple scripts that use regex to parse import and export directives in Dart files without writing a full parser. Disallowing whitespace makes those scripts less likely to go wrong.

  • We think that many users have a mental model that the path part of an import is conceptually a single monolithic "token" and would be very surprised to see whitespace or comments in there. Disallowing them lets them maintain that mental model even if the grammar itself sees the path as a series of separate tokens.

  • We couldn't come up with any compelling reason to have whitespace or comments in there. So disallowing it for now avoids problems with whitespace or comments appearing in there. If we later discover a good reason to allow it, we can loosen the language then.

  • Even if we did allow this, we'd probably end up writing a lint rule to tell users not to do it (because it's confusing to read). So the implementation cost is probably effectively the same, it's just doing the error in the parser instead of the linter.

In practice, we expect very few users to ever even know about this restriction since almost no one would try to put whitespace in there in the first place.

I'll update the proposal and send out a PR for the change. I'll close this issue when that lands.

@lrhn
Copy link
Member Author

lrhn commented Aug 15, 2024

The parser can choose to collect the list of identifiers, reserved words and separators and keep them all in the AST, or to combine them all into one synthetic token. Whatever works best. (Just accumulating is likely the easiest. It makes it more immediately accessible to find (for some naming, based on the specified grammar) unquotedUrl.pathSegments.pathWords.last, which is the default file name, without having to parse the combined string..

It's still possible to have comments just before and after the unquoted path.

That means that a typo like

import banana/apricot//orange
  as orange;

is syntactically valid.

The formatter will likely format that as

import banana/apricot //orange
  as orange;

and syntax highlighting will show the //orange in a different color than the rest.
It's also unlikely that there is both a lib/apricot/ directory and a lib/apricot.dart file in the banana package, so the user will get an import error.
All in all, it's unlikely that the typo goes unnoticed for long.

The analyzer may still want to eagerly give a warning if a // comment immediately follows an import path with no whitespace between, if its possible to detect that while the author is writing, instead of waiting for the file to be saved and formatted.

munificent added a commit that referenced this issue Aug 21, 2024
)

Allow reserved words in unquoted imports but disallow whitespace.

Fix #3983.
Fix #3984.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Proposed language feature that solves one or more problems unquoted-uris The unquoted URI feature
Projects
None yet
Development

No branches or pull requests

2 participants