Skip to content

Parser redesign #880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from
Closed

Parser redesign #880

wants to merge 27 commits into from

Conversation

o11c
Copy link
Contributor

@o11c o11c commented Sep 24, 2015

Don't merge yet! See below. The basic design is sound though.

This is a massive rewrite of the parser infrastructure to provide more useful information.

Features:

  • Good support of every single python dialect since 2.0 (currently too permissive in some cases)
  • Not tied to the current interpreter version.
  • Cleanly handle mypy-specific extensions
  • Never stops or throws an exception if there is a syntax error - always produce some parse tree.
  • Produce high-quality error messages for all errors (FSVO high-quality).
  • Full span info, not just lines, efficiently
  • Complete parse tree, not just AST, which can be easily used for pretty-printing.
  • Comments as first-class entities

Todo:

  • Add tests and fix all the bugs (some assertions are still firing on all nontrivial input, but I suspect where the error is)
  • Reject code that is invalid for a given python version.
  • Write lowering logic. Requires rebase of Fix/Improve Test Driver #721 already, stop holding it up!
  • Optionally parse known uses of docstrings and comments (partially needed for lowering)
  • Output to pretty XML/html option, closely tied Implement XML-based reports #713 (can be done after merge, will likely make the mypy.syntax.parser entry-point public)
  • Add a cache and parse deltas when possible (hard, has odd span effects, and parsing isn't the expensive part anyway - requires much thought about the rest of mypy).

@o11c o11c force-pushed the syntax branch 3 times, most recently from 8372f6c to 9cb9364 Compare September 25, 2015 21:19
This was referenced Sep 29, 2015
@o11c
Copy link
Contributor Author

o11c commented Oct 2, 2015

@JukkaL I tried dropping o11c@569d17e but it looks like it's still necessary, I have no idea what it should actually be doing though.

@o11c
Copy link
Contributor Author

o11c commented Oct 3, 2015

Okay, rebased on top of rebased driver.

Haven't worked on lint issues yet - still have testcases to write first.

@JukkaL
Copy link
Collaborator

JukkaL commented Oct 6, 2015

I still can't get it work in standalone parser mode:

$ python3 -m mypy.syntax.parser mypy/main.py
Traceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 1665, in <module>
    main()
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 1657, in main
    rv += _do_input(cm, dialect, a, t)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 1641, in _do_input
    file = parse(cm, dialect, name, txt)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 1620, in parse
    return Parser(cm, dialect).parse(name, txt)[0]
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 314, in parse
    rv = self.do_parse_file_input(token_file)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 320, in do_parse_file_input
    return self.parse_file_input(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 334, in parse_file_input
    stmts.append(self.parse_stmt(pkb))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 382, in parse_stmt
    rvf = self.parse_func_def(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 532, in parse_func_def
    colon, suite = self.parse_colon_suite(pkl)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 716, in parse_colon_suite
    return (colon, self.parse_complex_suite(pkl))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 943, in parse_complex_suite
    stmts = self.do_parse_block(block)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 440, in do_parse_block
    return self.parse_block(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 449, in parse_block
    stmts.append(self.parse_stmt(pkb))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 356, in parse_stmt
    return self.parse_try_stmt(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 485, in parse_try_stmt
    except_suites.append(self.parse_except_suite(pkb))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 612, in parse_except_suite
    colon, suite = self.parse_colon_suite(pkl)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 716, in parse_colon_suite
    return (colon, self.parse_complex_suite(pkl))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 943, in parse_complex_suite
    stmts = self.do_parse_block(block)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 440, in do_parse_block
    return self.parse_block(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 449, in parse_block
    stmts.append(self.parse_stmt(pkb))
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 354, in parse_stmt
    return self.parse_for_stmt(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 474, in parse_for_stmt
    for_suite = self.parse_for_suite(pkb)
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 583, in parse_for_suite
    pkl = pkb.enter_token_line()
  File "/home/jukka/project/mypy/mypy/syntax/parser.py", line 250, in enter_token_line
    assert isinstance(tl, TokenLine), type(tl).__name__
AssertionError: TokenBlock

Can you repro the issue?

@o11c
Copy link
Contributor Author

o11c commented Oct 6, 2015

Ah, that's support for async stuff breaking something again. Haven't gotten to unit tests for compound statements yet (though I know that if and def work), just simple statements.

Thinking about merging the trees for testlistnocond, testlist, and sequenceitemlist, maybe slices ... and compfor/compif (just keep the different parsers) ... python3.5 is so inconsistent, e.g. x = 1, *[2, 3] vs x += 1, *[2, 3].

@o11c
Copy link
Contributor Author

o11c commented Oct 6, 2015

Am busy until this afternoon/evening, since I know this one is incomplete, it may be more profitable for you to look at #887 (which this is not yet rebased on top of) and #721 of course. But mypy.syntax.lexer should be complete.

@JukkaL
Copy link
Collaborator

JukkaL commented Oct 12, 2015

Ok, I gave this a cursory look. Here are just some general high-level observations. As mentioned below, reviewing this will be a huge amount of work, and in order to make my time investment worthwhile this needs to improve things in general for mypy. I can't do a detailed review until it works and it looks likely that it will be a net benefit (benefit of having the PR minus the cost of reviewing and related overhead). I'm happy give high level feedback (similar to this) while you are working on this, however, now that I've finally merged the other PRs.

Performance

I mentioned previously that type checking performance is important. I still have no good idea of how this performs relative to the old parser, as I haven't been able to try it with any non-trivial pieces of code. I'd recommend addressing this first by posting some benchmarks (and instructions for how to run those benchmarks). Currently one of the biggest obstacles for wider use of mypy is type checking speed. If we get new features, it's okay to sacrifice a little speed, but even a 10% performance regression (for the whole type checker, which would mean maybe 30% parser performance regression) would be painful. The original parser wasn't written for efficiency really, so it may be possible to actually improve performance. I hope that's the case!

Size

This thing is frickin' huge! I didn't realize this until now. It's 11 kLOC (including tests) whereas the current mypy master (everything under mypy/) is about 17 kLOC. It really needs to improve things a lot to get merged. Just reviewing it thoroughly would probably take me at least one week working full time. As I'm spending about 4-8 hours a week on mypy, that means that reviewing it would probably take me months even if I didn't do anything else mypy related :-/

Clarity/documentation

I couldn't understand how the implementation works at a high level. Code clarity needs to be at least similar to the rest of the mypy implementation, and it isn't quite there yet. Adding comments and docstrings would probably help, but the most important thing would be to write a top-level summary or design doc of how things work and interact with the rest of mypy, with concrete examples. The current mypy implementation is mostly optimized for ease of modification and understanding, and I want to preserve that.

Scope

This seems to address many separate issues in one go. As I've mentioned before, I'd much prefer having changes done incrementally rather than in a rewrite-the-world fashion. Addressing single issue at a time makes things much easier to review and evaluate the impact (cost/benefit) to mypy. Maybe it's too late for this PR, but I hope any future changes approach things more incrementally. So in general a PR should be mostly about refactoring something (specific) or mostly adding a feature, but prefarably not both. This is clearly refactoring (it rewrites a ton of stuff) but it also adds features.

Style

I'm worried that the design doesn't seem to quite follow the style used in the rest of mypy. Again, stylistic cohesion is important for the maintainability of mypy. Some individual things:

  • The tests don't use the same approach as the rest of mypy. Is there a good reason for that? The old parser tests use .test files which have worked well for me, and these are used for other tests as well.
  • Code formatting and styling is different from the rest of mypy. Lint will catch many of these issues but not all.
  • The new parser uses from x import * which the rest of mypy does not use (on purpose).
  • Naming of things is different from the current mypy implementation. Often there seems to be no obvious benefit to this (e.g., Subscription vs. IndexExpr). I'd rather see the existing conventions used, as it will make life easier for everybody who's already familiar with the code. I'm not saying that your conventions are any worse (they could well be better) but cohesion and continuity are more important than having any specific terminology.

Costs/benefits

I'd like to see a short summary of the tradeoffs in the parser design. I.e., what it does better than the old parser (and how), with examples.

Concrete next steps

In summary, to avoid a similar bad experience as with the test driver, I suggest these concrete steps (in this order, before further work on this PR):

  1. Respond to issues I raised above (no need to address them yet) and what you think about them.
  2. Create issues for things that you'll be working on. I can create a tag for these. This is a sufficiently big PR that it's hard to manage it just as a PR and follow progrses. We can move discussion to multiple github issues. Maybe we can use your own repo fork to track progress and file issues? I haven't tried this workflow but it might work here.
  3. Come up together with a plan of how to move this forward in increments, including expectations for code review load. For example, we could agree that I'll try to review piece X of the code by date Y, and I can then merge that part of code to the master, even if won't be used yet by default.

@o11c
Copy link
Contributor Author

o11c commented Oct 12, 2015

I'm not as likely to get this done as soon as I was thinking - there are some error cases that don't generate messages nearly as good as they should. That said, one thing I would like to do is replace more of the linear searches with hash lookups.

The three major benefits of the new lexer/parser are:

  • Produce error information with column information (so that your editor can jump right to them instead of the start of the line)
  • Produce caret diagnostics with the input line, and squiggle for the length of the span (makes it much easier to see the error)
  • Keep going after errors (you don't have to restart mypy after fixing every syntax error, you can fix them all at once).
    • Note that, as much as we write lowering to support it, we can even proceed to type-checking - most syntax errors will be recovered very quickly (and my planned stuff will fix them even faster), and we can just say that the type of error expressions is Any and ignore any statements with errors.

I'm not sure what problem you have with scope - I don't see hardly any unrelated things being forced together here, though it probably would be possible to use the new lexer with the old parser if you really wanted to force that. Note that I have some local work in progress to refactor out Bisector and a replacement for Peekable, and I've already refactored out Dialect. But my refactoring in myunit (see #908) has to come first (except the Dialect bit which is usable right now, see #887 - I will probably modify mypy to use that and fix #732 before I even get back to this).

I am quite aware that I haven't done benchmarking yet, but even with what I have, the (end-user) usability is greatly improved.

The tree names I used follow the ones in Python documentation, I did not consult mypy's previous names at all. Unfortunately, Python's documentation contradicts itself with the names ...

Let's look at number of lines of code (numbers from my working copy, might not line up exactly with the pushed version), by component:

  • Dialect:
  • Span:
    • mypy/syntax/span.py: 401 lines
    • Not trivial, but there's nothing terribly complicated here - just the infrastructure for objects that have rich span information, and what you can do with them. It should be very familiar to anyone who's worked on a real compiler (i.e. one that can continue from errors and give good messages - CPython is not one).
    • The 30-line Bisector class will be refactored out in a bit since it is useful elsewhere.
    • One thing that might be confusing at first is that a Span object is just a pair of integers, and that the human-readable version has to be looked up in CodeMap. This is a significant performance win in other tools because there are a lot of these constructed at parse time, so they need to be cheap unless you need them specifically for an error.
    • I anticipate also using rich spans for .test files.
  • Lexer:
    • mypy/lex.py: 886 lines
    • mypy/syntax/tokens.py: 193 lines
    • mypy/syntax/lexer.py: 661 lines
    • new total: 854 lines
    • New lexer is shorter, produces more information, and is probably faster since regexes run in C.
  • Parser:
    • mypy/nodes.py: 1774 lines
    • mypy/parse.py: 1860 lines
    • mypy/parsetype.py: 250 lines
    • mypy/syntax/tree.py: 2765 lines
    • mypy/syntax/parser.py: 1681 lines
    • The parser itself is quite a bit shorter (including utility classes that I'm trying to refactor out), though at admittedly I'm not completely done with it yet. tree.py is longer than nodes.py because it preserves things like commas, but the module itself is trivial except for the TreeError logic (which previously mypy didn't have at all).
    • Also mypy.syntax.lowering hasn't been written yet.
  • Tests:
    • mypy/test/testlex.py: 450 lines
    • mypy/test/testparse.py: 74 lines
    • mypy/test/data/parse-errors.test: 453 lines
    • mypy/test/data/parse-python2.test: 173 lines
    • mypy/test/data/parse.test: 3336 lines
    • old total: 4486 lines
    • mypy/syntax/test/test_lexer.py: 1854 lines
    • mypy/syntax/test/test_parser.py: 4238 lines
    • mypy/syntax/test/test_span.py: 297 lines
    • new total: 6389 lines
    • My tests are far more complete, and check errors much more accurately. And the parser tests aren't even done yet.
    • I didn't use .test files because you invented your own unit test framework, so you shouldn't expect people to know how it works, especially for something complex like "this test should produce different output/error depending on which dialect of python we are targeting". But the time I finish Plans for mypy.myunit #908 I should be familiar enough to add the necessary data parser hooks though.

@o11c o11c closed this Oct 13, 2015
@o11c o11c deleted the syntax branch October 13, 2015 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants