Rewritten lexer using Megaparsec supporting hex and bytes literals #1750

pchiusano · 2020-11-19T22:40:41Z

(You can start reviewing, but hold off on merging: I have one todo left, making lexer errors print out nicely.)

This is a drop in replacement for the old lexer, totally rewritten using Megaparsec. This was done to support new documentation syntax which will require some lexer fanciness. The code isn't really much shorter, but it's now more straightforward parser combinator code. The layout handling, however, is still pretty finicky.

I added a couple features during the rewrite:

Hexadecimal and octal literals, like 0xdeadbeef or 0o012723, which are parsed as the corresponding Nat.
Bytes literals, which start with 0xs, for instance 0xs9ab38bcf83ffabb9012, which are used by the pretty-printer.

Interesting/controversial decisions

I was going to add a syntax for identifiers that include spaces or other special characters, using surrounding double backticks (or any number of enclosing backticks that's greater than 1). The lexer change was very easy, but the pretty-printer change seemed annoying and hard to do properly because of how names are just an opaque text value currently. So I left this feature commented out for now.

Test coverage

There's good coverage from all the existing tests and transcripts. The new bytes literals feature has some testing from existing transcripts. I added some tests of the hex and octal literals as well to the term printer test suite.

I added a new transcript (error-messages.md) demonstrating some of the new error messages.

Loose ends

I wasn't able to delete some of the old wordy and symboly id parsers which are currently being used in path parsing which I didn't want to mess with for this PR. But I'd like to clean that up at some point.

…unction behavior (untested)

Previously, it was popping the layout stack even when the parser was in the 'opening' state. Correct behavior when in 'opening' state is just to push onto the stack and run the parser. Popping and/or emitting virtual semis should only happen when not in the 'opening' state.

…print these properly

aryairani · 2020-11-20T16:58:16Z

Sounds good. Agree we should change Name to something structured and then support escaping segments. And then pull out syntax into an interface :)

…and make messages prettier

pchiusano · 2020-11-20T18:19:28Z

Okay, I think this is all set.

I did some work to improve the error reporting and added a transcript to demonstrate (error-messages.md). Feel free to add other failing code to that transcript to check for sensible error messages.

@mitchellwrosen you can take a look if you like, otherwise I will merge in a day or two assuming no objections from anyone else.

parser-typechecker/src/Unison/Lexer.hs

mitchellwrosen · 2020-11-20T23:15:41Z

parser-typechecker/src/Unison/Lexer.hs

+
+      num :: Maybe String -> Integer -> Lexeme
+      num sign n = Numeric (fromMaybe "" sign <> show n)
+      sign = P.optional (lit "+" <|> lit "-")


I think the result of this is always fromMaybe ""'d - if so, it could be refactored to

sign = lit "+" <|> lit "-" <|> pure ""

mitchellwrosen · 2020-11-20T23:24:59Z

LGTM! My main feedback is that the implementation looks a little bit "stringy" - like if you had to get in there and amend something, you might have to lean on a comprehensive test suite rather than the type checker to make sure you've handled all the cases. But for a lexer, that's probably expected :)

pchiusano added 30 commits November 6, 2020 13:38

wip on megaparsec-based lexer

185bbc4

more wip

c6c136a

filled in rest of main body of lexer (untested)

dc2ec53

note to self

253472a

Finished draft of lexer0 function intended to match existing lexer0 f…

b7a35f4

…unction behavior (untested)

Hooked up new lexer, debugging...

032d468

Fixed a bunch of bugs

796c82b

Fix issue - with keyword closes a handle block or a match block

d2581fd

more debugging

6789ed1

315 tests failing

2fa8780

fix floating point parsing

cf3d7dd

285 failures

ba427c1

263 failures

ad35aeb

down to 8 lexer failures

133ee55

tweaks

a16f4da

6 lexer failures

566afc1

down to 3 lexer failures

c86020a

Down to 1 lexer test failure

7e9aba8

all lexer tests passing!

20d5353

183 total test failures

bdc5471

fix a bunch more failures

e9ea538

down to 22 failures! zing

4229695

down to 14 test failures

3cd148f

12 failures

db9bbc1

down to 1 failure!

abc1982

(old) doc parsing, all tests pass

a606fba

All tests and transcripts passing

0a5c9bb

Delete old lexer

c17ee6d

cleanup

bc7a9ad

pchiusano added 5 commits November 19, 2020 15:54

disabling litSeg for now since pretty-printer isn't really set up to …

bb036cc

…print these properly

prettyprinting of bytes literals

0803c6f

update parser for bytes literals and fix lexer hex and bytes parsing bug

eed0c5a

rerun transcripts

f4a7ebe

Merge remote-tracking branch 'origin/trunk' into topic/lexer2

20e5676

pchiusano marked this pull request as ready for review November 20, 2020 01:34

runarorama approved these changes Nov 20, 2020

View reviewed changes

pchiusano added 2 commits November 19, 2020 21:46

Add some tests of hex and octal literals

062f4c2

some comments and cleanup of doc parser

be0906b

aryairani requested a review from mitchellwrosen November 20, 2020 16:55

pchiusano added 4 commits November 20, 2020 12:14

working on nicer error messages

aacc460

Get rid of some backtracking that was generating bad error messages, …

b6dd546

…and make messages prettier

working on removing some backtracking

ca5efc3

remove some unused backtracking

b1c83e2