Skip to content

Compiler: Modernize the js lexer, now utf-8 aware #1386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jan 19, 2023
Merged

Conversation

hhugo
Copy link
Member

@hhugo hhugo commented Jan 16, 2023

In particular, it allows to recognize and emit utf8 identifiers.
This lexer now uses sedlex. The implementation was taken from flow and cleaned to remove unused features.

@hhugo
Copy link
Member Author

hhugo commented Jan 16, 2023

@dbuenzli, fully fix #1034

@dbuenzli
Copy link
Contributor

I'm not familiar enough with the code base to go through these changes to approve them and don't know how much non US-ASCII entry points there are in the wild JavaScript world but the fact that needless string munging is avoided is very nice.

@hhugo
Copy link
Member Author

hhugo commented Jan 17, 2023

Minification and variable renaming is currently broken with this PR.
The optimisation rely of a free variable analysis, which rely on string equality of idents.

Idents are currently not normalized and even contain escape sequence verbatim.
We could either

  • normalize ident
  • disable the variable renaming for unicode ident (We would still need to resolve escape sequence)

@dbuenzli
Copy link
Contributor

Did you check what regular JavaScript minifier do ? I suspect they don't do these things.

Note that there are technologies like xml that rely on balancing unicode identifiers and throw normalization out of the window and no-one seems to ever run in the problematic and puzzling cases :-)

@hhugo
Copy link
Member Author

hhugo commented Jan 17, 2023

Did you check what regular JavaScript minifier do ? I suspect they don't do these things.

Note that there are technologies like xml that rely on balancing unicode identifiers and throw normalization out of the window and no-one seems to ever run in the problematic and puzzling cases :-)

I had a quick look at terser and flow. I didn't see any unicode normalization. However, they both decode escape sequence in ident.

@vouillon
Copy link
Member

According to the ECMAScript specification, one should not normalize but just decode escape sequences (and check that this still results in a valid identifier).

@hhugo
Copy link
Member Author

hhugo commented Jan 17, 2023

The last commit decodes escape sequence in ident

@hhugo hhugo requested a review from dbuenzli January 18, 2023 10:37
@hhugo hhugo mentioned this pull request Jan 18, 2023
Copy link
Contributor

@dbuenzli dbuenzli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a look at the last commit. Don't claim I undersood everything of it but the tests look right :-)

@hhugo hhugo merged commit 38bdb92 into master Jan 19, 2023
@hhugo hhugo deleted the modern-js-lexer branch January 19, 2023 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants