Parser redesign #880

o11c · 2015-09-24T21:10:27Z

Don't merge yet! See below. The basic design is sound though.

This is a massive rewrite of the parser infrastructure to provide more useful information.

Features:

Good support of every single python dialect since 2.0 (currently too permissive in some cases)
Not tied to the current interpreter version.
Cleanly handle mypy-specific extensions
Never stops or throws an exception if there is a syntax error - always produce some parse tree.
Produce high-quality error messages for all errors (FSVO high-quality).
Full span info, not just lines, efficiently
Complete parse tree, not just AST, which can be easily used for pretty-printing.
Comments as first-class entities

Todo:

Add tests and fix all the bugs (some assertions are still firing on all nontrivial input, but I suspect where the error is)
Reject code that is invalid for a given python version.
Write lowering logic. Requires rebase of Fix/Improve Test Driver #721 already, stop holding it up!
Optionally parse known uses of docstrings and comments (partially needed for lowering)
Output to pretty XML/html option, closely tied Implement XML-based reports #713 (can be done after merge, will likely make the mypy.syntax.parser entry-point public)
Add a cache and parse deltas when possible (hard, has odd span effects, and parsing isn't the expensive part anyway - requires much thought about the rest of mypy).

@AbstractMethod

Use the @AbstractMethod decorator on abstract methods which are implemented in all subtypes. Part of python#730

This helps to point out the offending unimplemented abstract methods in subtypes. Part of python#730.

part of python#730

Part of python#730.

Part of python#730

Part of python#730.

Part of python#730

o11c · 2015-10-02T07:13:43Z

@JukkaL I tried dropping o11c@569d17e but it looks like it's still necessary, I have no idea what it should actually be doing though.

o11c · 2015-10-03T23:30:11Z

Okay, rebased on top of rebased driver.

Haven't worked on lint issues yet - still have testcases to write first.

JukkaL · 2015-10-06T14:48:39Z

I still can't get it work in standalone parser mode:

$ python3 -m mypy.syntax.parser mypy/main.py
Traceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 1665, in <module>
    main()
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 1657, in main
    rv += _do_input(cm, dialect, a, t)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 1641, in _do_input
    file = parse(cm, dialect, name, txt)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 1620, in parse
    return Parser(cm, dialect).parse(name, txt)[0]
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 314, in parse
    rv = self.do_parse_file_input(token_file)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 320, in do_parse_file_input
    return self.parse_file_input(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 334, in parse_file_input
    stmts.append(self.parse_stmt(pkb))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 382, in parse_stmt
    rvf = self.parse_func_def(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 532, in parse_func_def
    colon, suite = self.parse_colon_suite(pkl)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 716, in parse_colon_suite
    return (colon, self.parse_complex_suite(pkl))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 943, in parse_complex_suite
    stmts = self.do_parse_block(block)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 440, in do_parse_block
    return self.parse_block(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 449, in parse_block
    stmts.append(self.parse_stmt(pkb))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 356, in parse_stmt
    return self.parse_try_stmt(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 485, in parse_try_stmt
    except_suites.append(self.parse_except_suite(pkb))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 612, in parse_except_suite
    colon, suite = self.parse_colon_suite(pkl)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 716, in parse_colon_suite
    return (colon, self.parse_complex_suite(pkl))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 943, in parse_complex_suite
    stmts = self.do_parse_block(block)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 440, in do_parse_block
    return self.parse_block(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 449, in parse_block
    stmts.append(self.parse_stmt(pkb))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 354, in parse_stmt
    return self.parse_for_stmt(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 474, in parse_for_stmt
    for_suite = self.parse_for_suite(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 583, in parse_for_suite
    pkl = pkb.enter_token_line()
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 250, in enter_token_line
    assert isinstance(tl, TokenLine), type(tl).__name__
AssertionError: TokenBlock

Can you repro the issue?

o11c · 2015-10-06T16:40:09Z

Ah, that's support for async stuff breaking something again. Haven't gotten to unit tests for compound statements yet (though I know that if and def work), just simple statements.

Thinking about merging the trees for testlistnocond, testlist, and sequenceitemlist, maybe slices ... and compfor/compif (just keep the different parsers) ... python3.5 is so inconsistent, e.g. x = 1, *[2, 3] vs x += 1, *[2, 3].

o11c · 2015-10-06T17:10:35Z

Am busy until this afternoon/evening, since I know this one is incomplete, it may be more profitable for you to look at #887 (which this is not yet rebased on top of) and #721 of course. But mypy.syntax.lexer should be complete.

JukkaL · 2015-10-12T00:06:45Z

Ok, I gave this a cursory look. Here are just some general high-level observations. As mentioned below, reviewing this will be a huge amount of work, and in order to make my time investment worthwhile this needs to improve things in general for mypy. I can't do a detailed review until it works and it looks likely that it will be a net benefit (benefit of having the PR minus the cost of reviewing and related overhead). I'm happy give high level feedback (similar to this) while you are working on this, however, now that I've finally merged the other PRs.

Performance

I mentioned previously that type checking performance is important. I still have no good idea of how this performs relative to the old parser, as I haven't been able to try it with any non-trivial pieces of code. I'd recommend addressing this first by posting some benchmarks (and instructions for how to run those benchmarks). Currently one of the biggest obstacles for wider use of mypy is type checking speed. If we get new features, it's okay to sacrifice a little speed, but even a 10% performance regression (for the whole type checker, which would mean maybe 30% parser performance regression) would be painful. The original parser wasn't written for efficiency really, so it may be possible to actually improve performance. I hope that's the case!

Size

This thing is frickin' huge! I didn't realize this until now. It's 11 kLOC (including tests) whereas the current mypy master (everything under mypy/) is about 17 kLOC. It really needs to improve things a lot to get merged. Just reviewing it thoroughly would probably take me at least one week working full time. As I'm spending about 4-8 hours a week on mypy, that means that reviewing it would probably take me months even if I didn't do anything else mypy related :-/

Clarity/documentation

I couldn't understand how the implementation works at a high level. Code clarity needs to be at least similar to the rest of the mypy implementation, and it isn't quite there yet. Adding comments and docstrings would probably help, but the most important thing would be to write a top-level summary or design doc of how things work and interact with the rest of mypy, with concrete examples. The current mypy implementation is mostly optimized for ease of modification and understanding, and I want to preserve that.

Scope

This seems to address many separate issues in one go. As I've mentioned before, I'd much prefer having changes done incrementally rather than in a rewrite-the-world fashion. Addressing single issue at a time makes things much easier to review and evaluate the impact (cost/benefit) to mypy. Maybe it's too late for this PR, but I hope any future changes approach things more incrementally. So in general a PR should be mostly about refactoring something (specific) or mostly adding a feature, but prefarably not both. This is clearly refactoring (it rewrites a ton of stuff) but it also adds features.

Style

I'm worried that the design doesn't seem to quite follow the style used in the rest of mypy. Again, stylistic cohesion is important for the maintainability of mypy. Some individual things:

The tests don't use the same approach as the rest of mypy. Is there a good reason for that? The old parser tests use .test files which have worked well for me, and these are used for other tests as well.
Code formatting and styling is different from the rest of mypy. Lint will catch many of these issues but not all.
The new parser uses from x import * which the rest of mypy does not use (on purpose).
Naming of things is different from the current mypy implementation. Often there seems to be no obvious benefit to this (e.g., Subscription vs. IndexExpr). I'd rather see the existing conventions used, as it will make life easier for everybody who's already familiar with the code. I'm not saying that your conventions are any worse (they could well be better) but cohesion and continuity are more important than having any specific terminology.

Costs/benefits

I'd like to see a short summary of the tradeoffs in the parser design. I.e., what it does better than the old parser (and how), with examples.

Concrete next steps

In summary, to avoid a similar bad experience as with the test driver, I suggest these concrete steps (in this order, before further work on this PR):

Respond to issues I raised above (no need to address them yet) and what you think about them.
Create issues for things that you'll be working on. I can create a tag for these. This is a sufficiently big PR that it's hard to manage it just as a PR and follow progrses. We can move discussion to multiple github issues. Maybe we can use your own repo fork to track progress and file issues? I haven't tried this workflow but it might work here.
Come up together with a plan of how to move this forward in increments, including expectations for code review load. For example, we could agree that I'll try to review piece X of the code by date Y, and I can then merge that part of code to the master, even if won't be used yet by default.

o11c · 2015-10-12T02:10:29Z

I'm not as likely to get this done as soon as I was thinking - there are some error cases that don't generate messages nearly as good as they should. That said, one thing I would like to do is replace more of the linear searches with hash lookups.

The three major benefits of the new lexer/parser are:

Produce error information with column information (so that your editor can jump right to them instead of the start of the line)
Produce caret diagnostics with the input line, and squiggle for the length of the span (makes it much easier to see the error)
Keep going after errors (you don't have to restart mypy after fixing every syntax error, you can fix them all at once).
- Note that, as much as we write lowering to support it, we can even proceed to type-checking - most syntax errors will be recovered very quickly (and my planned stuff will fix them even faster), and we can just say that the type of error expressions is Any and ignore any statements with errors.

I'm not sure what problem you have with scope - I don't see hardly any unrelated things being forced together here, though it probably would be possible to use the new lexer with the old parser if you really wanted to force that. Note that I have some local work in progress to refactor out Bisector and a replacement for Peekable, and I've already refactored out Dialect. But my refactoring in myunit (see #908) has to come first (except the Dialect bit which is usable right now, see #887 - I will probably modify mypy to use that and fix #732 before I even get back to this).

I am quite aware that I haven't done benchmarking yet, but even with what I have, the (end-user) usability is greatly improved.

The tree names I used follow the ones in Python documentation, I did not consult mypy's previous names at all. Unfortunately, Python's documentation contradicts itself with the names ...

Let's look at number of lines of code (numbers from my working copy, might not line up exactly with the pushed version), by component:

Dialect:
- mypy/syntax/dialect.py: 540 lines
- Trivial, consists almost entirely of stuff mypy doesn't yet distinguish.
- Note that I have not yet rebased the syntax version to include the Implementation stuff from Dialect #887, which is mostly needed because of Improve test coverage of the Python2 bits of mypy #732.
Span:
- mypy/syntax/span.py: 401 lines
- Not trivial, but there's nothing terribly complicated here - just the infrastructure for objects that have rich span information, and what you can do with them. It should be very familiar to anyone who's worked on a real compiler (i.e. one that can continue from errors and give good messages - CPython is not one).
- The 30-line Bisector class will be refactored out in a bit since it is useful elsewhere.
- One thing that might be confusing at first is that a Span object is just a pair of integers, and that the human-readable version has to be looked up in CodeMap. This is a significant performance win in other tools because there are a lot of these constructed at parse time, so they need to be cheap unless you need them specifically for an error.
- I anticipate also using rich spans for .test files.
Lexer:
- mypy/lex.py: 886 lines
- mypy/syntax/tokens.py: 193 lines
- mypy/syntax/lexer.py: 661 lines
- new total: 854 lines
- New lexer is shorter, produces more information, and is probably faster since regexes run in C.
Parser:
- mypy/nodes.py: 1774 lines
- mypy/parse.py: 1860 lines
- mypy/parsetype.py: 250 lines
- mypy/syntax/tree.py: 2765 lines
- mypy/syntax/parser.py: 1681 lines
- The parser itself is quite a bit shorter (including utility classes that I'm trying to refactor out), though at admittedly I'm not completely done with it yet. tree.py is longer than nodes.py because it preserves things like commas, but the module itself is trivial except for the TreeError logic (which previously mypy didn't have at all).
- Also mypy.syntax.lowering hasn't been written yet.
Tests:
- mypy/test/testlex.py: 450 lines
- mypy/test/testparse.py: 74 lines
- mypy/test/data/parse-errors.test: 453 lines
- mypy/test/data/parse-python2.test: 173 lines
- mypy/test/data/parse.test: 3336 lines
- old total: 4486 lines
- mypy/syntax/test/test_lexer.py: 1854 lines
- mypy/syntax/test/test_parser.py: 4238 lines
- mypy/syntax/test/test_span.py: 297 lines
- new total: 6389 lines
- My tests are far more complete, and check errors much more accurately. And the parser tests aren't even done yet.
- I didn't use .test files because you invented your own unit test framework, so you shouldn't expect people to know how it works, especially for something complex like "this test should produce different output/error depending on which dialect of python we are targeting". But the time I finish Plans for mypy.myunit #908 I should be familiar enough to add the necessary data parser hooks though.

Add classes for better dialect control

Catch incomplete TypeVisitor implementations

JamesGuthrie added 8 commits August 24, 2015 19:17

Fix typo in exception

375d875

Catch incomplete TypeVisitor implementations

5b69072

Use the @AbstractMethod decorator on abstract methods which are implemented in all subtypes. Part of python#730

Raise error on unimplemented abstract method call

19c922b

This helps to point out the offending unimplemented abstract methods in subtypes. Part of python#730.

Add SameTypeVisitor.visit_overloaded() implementation

2f2d84b

part of python#730

Add visit_overloaded() to TypeTranslator.

bf02a19

Part of python#730.

Add visit_overloaded() to TypeQuery

5502b38

Part of python#730

add visit_star_type() to TypeAnalyserPass3

d0bb829

Part of python#730.

add visit_overloaded() to ConstraintBuilderVisitor

f577a25

Part of python#730

o11c force-pushed the syntax branch 3 times, most recently from 8372f6c to 9cb9364 Compare September 25, 2015 21:19

This was referenced Sep 29, 2015

New option --python-version. #883

Merged

Typeshed #884

Merged

Dialect #887

Closed

o11c force-pushed the syntax branch from 9cb9364 to 0989fe7 Compare October 3, 2015 23:28

Add classes for better dialect control

cdab152

o11c mentioned this pull request Oct 12, 2015

Support new Python 3.5 features #913

Closed

o11c added 6 commits October 12, 2015 21:54

Test that install actually works

d9bbbbf

Merge pull request #2 from o11c/dialect

155b4ad

Add classes for better dialect control

Switch from distutils to setuptools

449b2f9

Add tox.ini

59b632e

Give a better message when data files are not found

cc03bbc

Put everything in package_data, not data_files

13e0d95

o11c added 12 commits October 13, 2015 00:52

Make sure tox runs the same commands as travis

fdeeffc

Whoops, looks like recursive globbing is not supported

58d5e57

Run tox from travis directly for more uniformity

90d3b89

Check stubs again after moving

e554bae

Fix type issue in dialect

3ff356a

Merge pull request #6 from o11c/upstream-pr-733

bb7c8aa

Catch incomplete TypeVisitor implementations

Use TOXENV instead of -e and pass posargs

c8ef77c

Fix assertion with blacklist

ea5e888

Rename fixture stubs so they don't get byte-compiled

802a535

Add stub for platform

a6f5990

Ensure that lib-python is treated as stub-only

58bc742

Use new Dialect objects to store version everywhere

a8d7eaa

o11c force-pushed the syntax branch from 0989fe7 to a8d7eaa Compare October 13, 2015 16:35

o11c closed this Oct 13, 2015

o11c deleted the syntax branch October 13, 2015 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parser redesign #880

Parser redesign #880

Uh oh!

o11c commented Sep 24, 2015

Uh oh!

o11c commented Oct 2, 2015

Uh oh!

o11c commented Oct 3, 2015

Uh oh!

JukkaL commented Oct 6, 2015

Uh oh!

o11c commented Oct 6, 2015

Uh oh!

o11c commented Oct 6, 2015

Uh oh!

JukkaL commented Oct 12, 2015

Uh oh!

o11c commented Oct 12, 2015

Uh oh!

Uh oh!

Uh oh!

Parser redesign #880

Parser redesign #880

Uh oh!

Conversation

o11c commented Sep 24, 2015

Uh oh!

o11c commented Oct 2, 2015

Uh oh!

o11c commented Oct 3, 2015

Uh oh!

JukkaL commented Oct 6, 2015

Uh oh!

o11c commented Oct 6, 2015

Uh oh!

o11c commented Oct 6, 2015

Uh oh!

JukkaL commented Oct 12, 2015

Performance

Size

Clarity/documentation

Scope

Style

Costs/benefits

Concrete next steps

Uh oh!

o11c commented Oct 12, 2015

Uh oh!

Uh oh!