CS2 Discussion: Project: Revisiting implementation of a new parser #77
Description
Hello. I'm a long time lurker here and would like to (re?)-raise a proposal.
Sorry for the long post 😄
The What:
I'd like to re-raise the issue of implementing a new parser for CoffeeScript.
Some previous related discussions:
- CS2 Discussion: Question: New Compiler Starting Point #25 (comment)
- CS2 Discussion: Project: Project Goals #21 (comment)
- CS2 Discussion: Question: Ensuring long-term maintainability of existing CoffeeScript codebases #33
- Proposal: Ecosystem: Babel plugin? #76
Scope:
The scope is intentionally limited to only creating a new parser.
No intent to touch the lexer & re-writer nor to modify the code generation parts.
The why:
As previously discussed the existing CS compiler infrastructure is a limiting
factor in the long term for CoffeeScript.
- Strings based code generation
- Incompatibility with Babel AST structures
- Difficulty in expanding Jison's capabilities
- See pulse of Jison.
- See commit graph of Jison.
- Particularly in the context of language services related capabilities such as error recovery / partial parsing.
Replacing the whole pipeline at once requires more resources than available to this project.
And even if those resources were available it is still a very risky approach.
Therefore an incremental approach is needed.
Architecture:
I propose to create a separation between the syntactic analysis and the AST creation.
This means that logic that creates the AST must not be embedded inside the parser.
Instead the parser should create a more low level structure, a Parse Tree / Concrete Syntax Tree.
which could be transformed afterwards to serve different needs, for example:
- Transformation to create the existing CS AST to support the existing compiler backend.
- Transformation to create a Babel AST to support a new experimental compiler backend.
- Transformation to an enriched AST structure that represents the entire syntactic information to support
language services tool such a formatting & refactoring.
This proposed separation of concerns will help to future proof the CoffeeScript compiler
by enabling future incremental changes such as replacing the compiler backend without
modifying (or diverging from) the compiler frontend (parser).
The How:
Warning Sales pitch incoming
Normally the standard approach to writing a parser for a compiler is to write one "by hand".
- See quote from Terence Parr (the creator of Antlr):
In my experience, almost no one uses parser generators to build commercial compilers.
The problem with this approach is that it can be a bit repetitive and error prone work.
And that implementing more advanced capabilities such as fault tolerance capabilities can be complex.
fortunately the last time I needed to write an hand built parser I was too lazy 😸 and instead
created a library that makes it easier to hand build parsers in JavaScript: Chevrotain
without any code generation.
Relevant Highlights:
- JSON Grammar example in CoffeeScript.
- Performance Benchmark (Jison included).
- About one order of magnitude faster than Jison on Chrome 57.
- Automatic CST output creation.
- This in an online playground, the first example also creates a CST output.
- Automatic fault tolerance capabilities.
The proposal is to write the new CoffeeScript parser in CoffeeScript (no code generation).
Using the Chevrotain Parsing library.
The who:
I can contribute enough time to try implementing this.
I obviously can't make any promises, but this won't be the first parser I've written so I've got a decent
chance of success.
Risks & Issues:
-
Factoring away left recursion (for LL(k) parser) may result in uglier parse trees.
-
Do the CoffeeScript's Token contain full position information?
- A worst case may require changes to the re-writer or even replacing the whole
lexer -> re-writer -> parser flow, but that is a less incremental approach.
- A worst case may require changes to the re-writer or even replacing the whole
-
My CoffeeScript skills are lacking, may require assistance in getting the code to decent quality.
-
Error messages contents and structure for invalid inputs will change.
-
Testing that the AST output is the same requires a large amount of valid CS source code.
-
Additional abstraction and separation will have an overhead performance wise.
- Should be mitigated by the higher base performance of Chevrotain vs Jison.
Questions:
-
Any feedback / suggestions?
-
Am I missing some blocker or potential show stopper here?
-
Is this approach acceptable/approved by the project leaders?
-
If a POC succeeds will there be assistance in integrating this into the CoffeeScript code base?
-
What percentage of the CoffeeScript running time is spent parsing?
- I'm trying to figure out the potential performance benefits for an E2E compilation flow.