Compiler: Modernize the js lexer, now utf-8 aware #1386

hhugo · 2023-01-16T15:02:43Z

In particular, it allows to recognize and emit utf8 identifiers.
This lexer now uses sedlex. The implementation was taken from flow and cleaned to remove unused features.

hhugo · 2023-01-16T15:03:39Z

@dbuenzli, fully fix #1034

dbuenzli · 2023-01-17T10:36:01Z

I'm not familiar enough with the code base to go through these changes to approve them and don't know how much non US-ASCII entry points there are in the wild JavaScript world but the fact that needless string munging is avoided is very nice.

hhugo · 2023-01-17T11:24:08Z

Minification and variable renaming is currently broken with this PR.
The optimisation rely of a free variable analysis, which rely on string equality of idents.

Idents are currently not normalized and even contain escape sequence verbatim.
We could either

normalize ident
disable the variable renaming for unicode ident (We would still need to resolve escape sequence)

dbuenzli · 2023-01-17T11:43:55Z

Did you check what regular JavaScript minifier do ? I suspect they don't do these things.

Note that there are technologies like xml that rely on balancing unicode identifiers and throw normalization out of the window and no-one seems to ever run in the problematic and puzzling cases :-)

hhugo · 2023-01-17T12:52:39Z

Did you check what regular JavaScript minifier do ? I suspect they don't do these things.

Note that there are technologies like xml that rely on balancing unicode identifiers and throw normalization out of the window and no-one seems to ever run in the problematic and puzzling cases :-)

I had a quick look at terser and flow. I didn't see any unicode normalization. However, they both decode escape sequence in ident.

vouillon · 2023-01-17T13:09:15Z

According to the ECMAScript specification, one should not normalize but just decode escape sequences (and check that this still results in a valid identifier).

hhugo · 2023-01-17T13:47:31Z

The last commit decodes escape sequence in ident

compiler/lib/flow_lexer.ml

dbuenzli

Just had a look at the last commit. Don't claim I undersood everything of it but the tests look right :-)

compiler/lib/flow_lexer.ml

hhugo force-pushed the modern-js-lexer branch from c50608c to dfc45e6 Compare January 16, 2023 15:52

hhugo force-pushed the modern-js-lexer branch from 2949e31 to cf47b57 Compare January 17, 2023 14:10

dbuenzli reviewed Jan 17, 2023

View reviewed changes

compiler/lib/flow_lexer.ml Outdated Show resolved Hide resolved

hhugo force-pushed the modern-js-lexer branch from cf47b57 to 1bfeab5 Compare January 17, 2023 16:12

hhugo requested a review from dbuenzli January 18, 2023 10:37

hhugo mentioned this pull request Jan 18, 2023

ES6 Support #508

Closed

dbuenzli approved these changes Jan 18, 2023

View reviewed changes

compiler/lib/flow_lexer.ml Outdated Show resolved Hide resolved

compiler/lib/flow_lexer.ml Outdated Show resolved Hide resolved

hhugo added 16 commits January 19, 2023 10:53

Compiler: flow parser, step 1, update js_token

780aa87

Compiler: fix js number literal

cfd83b5

Compiler: small refactoring

458f60e

Compiler: Abstract Parse_js.Lexer.t

2d00d69

Compiler: cleanup parser

697abee

Compiler: Flow, step 2, lexer is utf8 aware

ce9ecb5

Compiler: recognize unicode ident

b118ea2

OPAM: fix deps

e90dba8

CHANGES

4c007e2

Compiler: small cleanup

3184ca4

Compiler: make sure old and new lexer agree.

e82d66f

Compiler: tune type of labels

a7f1dba

Compiler: remove unused EQuote constructor

f62a07f

Compiler: decode escape sequence in ident

43d3afa

Compiler: remove some deadcode

8ae0ba0

Compiler: fast-path for utf8 ident validation

be83e83

hhugo added 4 commits January 19, 2023 10:53

Compiler: better errors for the lexer

9eb936c

Compiler: remove more deadcode in lexer

d031f97

Compiler: report error from lexer

428befc

Compiler: better handler of unicode ident

d9438e7

hhugo force-pushed the modern-js-lexer branch from fd666e6 to d9438e7 Compare January 19, 2023 09:53

hhugo merged commit 38bdb92 into master Jan 19, 2023

hhugo deleted the modern-js-lexer branch January 19, 2023 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compiler: Modernize the js lexer, now utf-8 aware #1386

Compiler: Modernize the js lexer, now utf-8 aware #1386

Uh oh!

hhugo commented Jan 16, 2023

Uh oh!

hhugo commented Jan 16, 2023

Uh oh!

dbuenzli commented Jan 17, 2023

Uh oh!

hhugo commented Jan 17, 2023

Uh oh!

dbuenzli commented Jan 17, 2023

Uh oh!

hhugo commented Jan 17, 2023

Uh oh!

vouillon commented Jan 17, 2023

Uh oh!

hhugo commented Jan 17, 2023

Uh oh!

Uh oh!

dbuenzli left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Compiler: Modernize the js lexer, now utf-8 aware #1386

Compiler: Modernize the js lexer, now utf-8 aware #1386

Uh oh!

Conversation

hhugo commented Jan 16, 2023

Uh oh!

hhugo commented Jan 16, 2023

Uh oh!

dbuenzli commented Jan 17, 2023

Uh oh!

hhugo commented Jan 17, 2023

Uh oh!

dbuenzli commented Jan 17, 2023

Uh oh!

hhugo commented Jan 17, 2023

Uh oh!

vouillon commented Jan 17, 2023

Uh oh!

hhugo commented Jan 17, 2023

Uh oh!

Uh oh!

dbuenzli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!