Skip to content

Ideas for improving typeshed testing and tooling #8774

Open
@AlexWaygood

Description

@AlexWaygood

A few days ago, I had a great conversation with @kkirsche offline about some ways we can improve our communications with contributors. One of the things that came up is that it would be good to have a clear outline of specific areas to do with tooling and testing where PRs would be welcomed. With that in mind, here's some areas where I think our testing and tooling could be improved, and where I would welcome contributions.

For any of these tools/ideas: if a PR would be enough work that it'd feel wasteful if it was rejected, please make sure to open an issue discussing the specific change first. If an issue gets a few thumbs-ups, it's a pretty good sign we like the idea, and you should feel free to proceed :)

(Other maintainers: please feel free to edit this!)

Existing tools

Improving existing tools is the "low-hanging fruit": it's always going to be easier to build on existing infrastructure than building something new entirely from scratch.

For each of these tools, I've tried to list a "primary maintainer": this is the person who's most likely to make decisions on a certain tool. That doesn't mean that this person is the only person who's ever going to chime in on a certain tool. For example, @hauntsaninja is the person with the most expertise on stubtest, and people are probably always going to defer to him with regards to that script. But that doesn't mean that other people won't also be interested in reviewing -- @sobolevn and I also have experience with writing stubtest improvements, and may well chime in on a stubtest PR.

Stubtest

Primary maintainer: @hauntsaninja

Stubtest compares what exists at runtime to what we've got in the stub, and checks for inconsistencies between the two. It's one of the most useful tools in our CI, and also really easy to contribute to. If you make a PR for a stubtest improvement, it'll probably be reviewed by some combination of @hauntsaninja, @JelleZijlstra, @sobolevn and/or me within a few days :)

Stubtest improvements generally fall into two categories:

  • New checks for catching where the stubs are inconsistent with the runtime.
  • PRs to reduce false-positive errors. Reducing the number of false-positives is very useful as it makes it easier for us to set --ignore-missing-stubs to False for more third-party packages.

What to improve:

  • There's a bunch of TODO comments in the stubtest codebase. Any of those would be good candidates for PRs.
  • With regards to adding new checks: think creatively! Stubtest doesn't have a generalised way of comparing the runtime to the stub; it's just a series of "special cases" where we've "hardcoded" in checks for inconsistencies between the stub and the runtime. If you notice something stubtest is missing in CI, that's probably a good candidate for a stubtest PR. If you're unsure about whether the idea is good or not, file a mypy issue first -- if it gets a few thumbs-up, it's probably a pretty good sign that it's a good idea.

Stubgen

Primary maintainer: ?

Stubgen is a tool for autogenerating stub files from source code. It's not a code base I'm particularly familiar with, and I'm not sure it's quite so actively maintained as some of the other tools on this list -- but here's a few ways I think it could be improved:

  • Make generated stubs comply with flake-pyi "out of the box". Our "typeshed style" has evolved in recent months and years, but stubgen has been a bit slow to catch up. It would be great if the stubs stubgen produced were automatically compatible with flake-pyi, so that we didn't have to do so many manual fixes.
  • Fewer re-exports (maybe we can already do this with the --export-less option? Maybe we just need to update scripts/create_baseline_stubs.py, typeshed's wrapper script for running stubgen?) By default, stubgen almost always assumes that an import from somewhere else in the package should be treated as an explicit re-export. Unless the file is an __init__.py file, however, I think stubgen should actually generally assume the opposite. This comes up a lot in typeshed code reviews; fixing this would save a lot of time. (Fixed by Use stubgen's --export-less option in create_baseline_stubs.py #8843.)
  • Add "stubgen-but-with-inference" (@hauntsaninja's idea in Evaluate ways to determine accuracy and reduce incomplete type stubs #8481 (comment)). Mypy has a wealth of information that it could use to make sophisticated inference about good signatures for generated stubs. But even adding super-dumb rules like "__bool__ methods are generally going to return bool" are going to be correct 99% of the time, and will save time for typeshed reviewers.

flake8-pyi

Primary maintainer: @AlexWaygood

flake8-pyi is a flake8 plugin to check stub files that works through analysing the Abstract Syntax Tree of the stub. It's a tool that's a lot more "dumb" than stubtest, since it doesn't have the full power of a type checker behind it, and doesn't know anything about the runtime either. There's still a lot you can do just with simple AST analysis, however, and it's another code base that's easy to contribute to if you're familiar with the basics of ASTs and the visitor pattern.

Lots of recent checks we've introduced have been fairly style-focused, but we've also added some bugbear-esque checks in the past that are aimed at catching common bugs:

We should continue to think creatively about whether there are other common bugs that we can catch using a simple AST analysis.

flake8-pyi is maintained by the typeshed team, so if you have an idea for a new check, just file an issue! If it gets a few thumbs-up, it's probably a good idea, and you should feel free to make a PR. If we're sceptical, we'll probably say so :)

There's also a bunch of open issues already with ideas for new checks. If it has a few thumbs-up (or even if it's just older than a week or two and hasn't been closed yet), then it's probably an idea that we're receptive to!

mypy_primer

Primary maintainer: @hauntsaninja

Mypy_primer checks out a copy of typeshed prior to a PR and after a PR. It uses "old typeshed" and "new typeshed" to run mypy against millions of lines of open-source code, and reports the difference between the two.

Here's some ways I think we could make mypy_primer even better:

  • Add more projects to mypy_primer (add more projects to mypy_primer hauntsaninja/mypy_primer#42). This one's a no-brainer.
  • Work on the ideas listed at https://github.com/hauntsaninja/mypy_primer#contributing. In particular, having a way to bisect between typeshed commits would be pretty nice.
  • Figure out ways to analyse "mypy_primer coverage":
    • For a given stubs package in typeshed, how many packages included in mypy_primer depend on that stubs package?
    • If we figure out a way of answering the above question: how best to communicate that information? Should we have a website somewhere that auto-updates information about mypy_primer coverage? A markdown file in typeshed that a bot auto-files PRs to update, listing mypy_primer coverage information for each stubs package? Etc. etc.

Test cases

Primary maintainer: @AlexWaygood

Our test_cases directory (aimed at providing regression tests for places where stubs have caused problems in the past) is new, and still somewhat experimental. The biggest outstanding issue I see was pointed out by @Akuli in #8066 (comment), which is that the test cases are liable to slip out of date with the upstream source code in much the same way that stubs are. How can we ensure that that doesn't happen? Can we add some kind of verification that ensures that the places where we're asserting type checkers shouldn't emit errors do actually pass at runtime?

Stubsabot

Primary maintainer: @hauntsaninja

Stubsabot -- a bot which automatically suggests bumping pinned versions for third-party stubs -- is another new addition to our tooling. Here's some ideas of how we could make it even better:

The "stubtest failed" bot

Primary maintainer: @hauntsaninja

Note: future status of this tool is TBD. typeshed-internal/stub_uploader#60 means that we can now pin stubs to more specific versions, which should hopefully reduce the number of daily stubtest failures we get. Maybe we'll be able to do away with the check altogether...

Stubtest is run on the full suite of third-party stubs every night, and a bot automatically opens an issue if stubtest failed on any of them (e.g. #8745). Things that would be great to see improved about this workflow are:

  • Could the bot automatically tell us which stubs stubtest failed on?
  • Could the bot automatically link to the precise stubtest run that failed, instead of us having to click through and navigate to the stubtest run that's red?
  • Could the bot automatically paste the results of the failing stubtest run into the issue description?
  • Can we stop the bot from creating a new issue if there's already a "stubtest failed" issue open?

Miscellany

These are both pretty minor nits, but it would be great to:

  • Add a test to make sure that the exclude-from-stricter-checks entries in pyrightconfig.stricter.json stay alphabetised, so I don't have to keep annoying contributors that break the alphabetisation ;) tests/check_consistent.py would be a good place for such a check. (Note that most json-parsing tools will refuse to parse pyrightconfig.stricter.json because it's strictly-speaking invalid json.)
  • Add a test that ensures that nobody writes if sys.version_info >= (3, 7) guards anywhere. (3.7 is the minimum supported typeshed version, so such checks are redundant.) tests/check_new_syntax.py might be a good place for such a check.

Ideas for new tools

These ideas are necessarily going to be more work than the ideas above, as most of them involve writing new scripts from scratch, which would create its own maintenance burden. But here's some pie-in-the-sky ideas for brand new tests/tools we could add.

Note that new tools don't need to be part of typeshed or to run in CI to make a difference. For instance, for a long time, we didn't run stubtest in CI, but during that time we used stubtest to fix many, many errors in typeshed. Keeping a tool separate is a good way to avoid disagreement over things like the balance of false positives vs false negatives, to iterate quickly on the tool, and to prove the tool's value.

  • More automated fixes:
    • Are there any other fast pre-commit hooks already out there that would be useful to add to our CI?
    • I love flake8-pyi, but I feel like it's probably more annoying for contributors than something they actively welcome. It's not something I'm planning on working on, because I don't like the idea of the amount of work it would take to get to a tool that's properly tested -- but a bunch of the checks in flake8-pyi feel like they could probably be automated fixes in a pre-commit hook.
    • Something I see very often is contributors trying to import TypedDict from typing (for example) -- which doesn't work, since it needs to be imported from typing_extensions in order to be compatible with Python 3.7. Is there an automated way of fixing that? (Can isort already do that, if we have the right settings?)
  • Use the CPython test suite. We could set up a mypy_primer-esque bot that runs mypy (using typeshed's stubs) against the CPython test suite before and after a PR, and reports on any differences in mypy's results. I have no idea what the signal-to-noise ratio here would be, and I think it might be a fair amount of work to get this to a script we could use in CI -- but maybe worth exploring?
  • Pyright has a sophisticated type inference engine that can be used to infer types in unannotated code. Can we use that somehow to compare pyright's inferences against our manual stubs? Could we do something similar with mypy using the --check-untyped-defs option?
  • Maybe some things in the space of making it easier to upstream stubs or validate type hints against the original code? Rough thoughts at Evaluate ways to determine accuracy and reduce incomplete type stubs #8481 (comment)
  • Any other ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedAn actionable problem of low to medium complexity where a PR would be very welcomeproject: infrastructuretypeshed build, test, documentation, or distribution related

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions