Skip to content

Run mypy the correct number of times #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 29, 2023
Merged

Conversation

brandtbucher
Copy link
Contributor

Currently, we run the mypy2 benchmark loops times, but report loops - 2 results. This seems to break some of pyperformance's assumptions, and results in weird artifacts in the output. This changes the benchmarking loop to run loops + 1 times and report loops results, as expected.

I've confirmed locally that this seems to fix the weirdness. pyperformance output before:

Run 1: calibrate the number of loops: 4
- calibrate 1: 0.00 ns (loops: 1, raw: 0.00 ns)
- calibrate 2: 0.00 ns (loops: 2, raw: 0.00 ns)
- calibrate 3: 308 ms (loops: 4, raw: 1.23 sec)
- calibrate 4: 301 ms (loops: 4, raw: 1.20 sec)
- calibrate 5: 597 ms (loops: 4, raw: 2.39 sec)
- calibrate 6: 302 ms (loops: 4, raw: 1.21 sec)
Calibration: 1 warmup, 4 loops
Run 2: 1 warmup, 3 values, 4 loops
- warmup 1: 307 ms (-25%)
- value 1: 323 ms (-21%)
- value 2: 313 ms (-24%)
- value 3: 592 ms (+45%)
Run 3: 1 warmup, 3 values, 4 loops
- warmup 1: 304 ms (-27%)
- value 1: 310 ms (-25%)
- value 2: 340 ms (-18%)
- value 3: 593 ms (+43%)
Run 4: 1 warmup, 3 values, 4 loops
- warmup 1: 310 ms (-28%)
- value 1: 324 ms (-24%)
- value 2: 335 ms (-22%)
- value 3: 627 ms (+46%)
Run 5: 1 warmup, 3 values, 4 loops
- warmup 1: 309 ms (-25%)
- value 1: 317 ms (-23%)
- value 2: 326 ms (-21%)
- value 3: 591 ms (+44%)
Run 6: 1 warmup, 3 values, 4 loops
- warmup 1: 324 ms (-24%)
- value 1: 339 ms (-20%)
- value 2: 323 ms (-24%)
- value 3: 613 ms (+44%)
Run 7: 1 warmup, 3 values, 4 loops
- warmup 1: 338 ms (-19%)
- value 1: 318 ms (-24%)
- value 2: 308 ms (-26%)
- value 3: 628 ms (+50%)
Run 8: 1 warmup, 3 values, 4 loops
- warmup 1: 312 ms (-25%)
- value 1: 323 ms (-23%)
- value 2: 323 ms (-23%)
- value 3: 612 ms (+46%)
Run 9: 1 warmup, 3 values, 4 loops
- warmup 1: 324 ms (-20%)
- value 1: 314 ms (-22%)
- value 2: 308 ms (-24%)
- value 3: 593 ms (+46%)
Run 10: 1 warmup, 3 values, 4 loops
- warmup 1: 310 ms (-26%)
- value 1: 312 ms (-26%)
- value 2: 340 ms (-19%)
- value 3: 607 ms (+45%)
Run 11: 1 warmup, 3 values, 4 loops
- warmup 1: 310 ms (-25%)
- value 1: 313 ms (-24%)
- value 2: 309 ms (-25%)
- value 3: 610 ms (+49%)
Run 12: 1 warmup, 3 values, 4 loops
- warmup 1: 325 ms (-20%)
- value 1: 316 ms (-22%)
- value 2: 310 ms (-24%)
- value 3: 594 ms (+46%)
Run 13: 1 warmup, 3 values, 4 loops
- warmup 1: 313 ms (-26%)
- value 1: 329 ms (-22%)
- value 2: 338 ms (-20%)
- value 3: 598 ms (+42%)
Run 14: 1 warmup, 3 values, 4 loops
- warmup 1: 335 ms (-19%)
- value 1: 311 ms (-25%)
- value 2: 330 ms (-20%)
- value 3: 597 ms (+45%)
Run 15: 1 warmup, 3 values, 4 loops
- warmup 1: 313 ms (-24%)
- value 1: 312 ms (-25%)
- value 2: 326 ms (-21%)
- value 3: 601 ms (+46%)
Run 16: 1 warmup, 3 values, 4 loops
- warmup 1: 308 ms (-27%)
- value 1: 328 ms (-22%)
- value 2: 309 ms (-26%)
- value 3: 623 ms (+48%)
Run 17: 1 warmup, 3 values, 4 loops
- warmup 1: 318 ms (-23%)
- value 1: 332 ms (-20%)
- value 2: 314 ms (-24%)
- value 3: 597 ms (+44%)
Run 18: 1 warmup, 3 values, 4 loops
- warmup 1: 318 ms (-23%)
- value 1: 325 ms (-21%)
- value 2: 318 ms (-23%)
- value 3: 596 ms (+44%)
Run 19: 1 warmup, 3 values, 4 loops
- warmup 1: 319 ms (-30%)
- value 1: 358 ms (-21%)
- value 2: 312 ms (-31%)
- value 3: 691 ms (+52%)
Run 20: 1 warmup, 3 values, 4 loops
- warmup 1: 316 ms (-29%)
- value 1: 332 ms (-25%)
- value 2: 317 ms (-29%)
- value 3: 683 ms (+54%)
Run 21: 1 warmup, 3 values, 4 loops
- warmup 1: 338 ms (-22%)
- value 1: 362 ms (-17%)
- value 2: 336 ms (-23%)
- value 3: 604 ms (+39%)

WARNING: the benchmark result may be unstable
* the standard deviation (139 ms) is 33% of the mean (420 ms)
* the maximum (691 ms) is 65% greater than the mean (420 ms)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

mypy2: Mean +- std dev: 420 ms +- 139 ms

After:

Run 1: calibrate the number of loops: 1
- calibrate 1: 598 ms (loops: 1, raw: 598 ms)
- calibrate 2: 617 ms (loops: 1, raw: 617 ms)
- calibrate 3: 590 ms (loops: 1, raw: 590 ms)
- calibrate 4: 675 ms (loops: 1, raw: 675 ms)
Calibration: 1 warmup, 1 loop
Run 2: 1 warmup, 3 values, 1 loop
- warmup 1: 576 ms (-10%)
- value 1: 627 ms
- value 2: 630 ms
- value 3: 664 ms
Run 3: 1 warmup, 3 values, 1 loop
- warmup 1: 611 ms (-6%)
- value 1: 636 ms
- value 2: 636 ms
- value 3: 687 ms (+5%)
Run 4: 1 warmup, 3 values, 1 loop
- warmup 1: 609 ms
- value 1: 635 ms
- value 2: 610 ms
- value 3: 633 ms
Run 5: 1 warmup, 3 values, 1 loop
- warmup 1: 604 ms
- value 1: 661 ms
- value 2: 606 ms
- value 3: 631 ms
Run 6: 1 warmup, 3 values, 1 loop
- warmup 1: 602 ms
- value 1: 631 ms
- value 2: 621 ms
- value 3: 643 ms
Run 7: 1 warmup, 3 values, 1 loop
- warmup 1: 657 ms
- value 1: 625 ms
- value 2: 604 ms
- value 3: 647 ms
Run 8: 1 warmup, 3 values, 1 loop
- warmup 1: 601 ms (-6%)
- value 1: 637 ms
- value 2: 623 ms
- value 3: 656 ms
Run 9: 1 warmup, 3 values, 1 loop
- warmup 1: 612 ms (-6%)
- value 1: 636 ms
- value 2: 631 ms
- value 3: 674 ms
Run 10: 1 warmup, 3 values, 1 loop
- warmup 1: 626 ms
- value 1: 650 ms
- value 2: 605 ms
- value 3: 632 ms
Run 11: 1 warmup, 3 values, 1 loop
- warmup 1: 607 ms (-5%)
- value 1: 647 ms
- value 2: 611 ms
- value 3: 669 ms
Run 12: 1 warmup, 3 values, 1 loop
- warmup 1: 655 ms
- value 1: 640 ms
- value 2: 638 ms
- value 3: 690 ms (+5%)
Run 13: 1 warmup, 3 values, 1 loop
- warmup 1: 662 ms
- value 1: 647 ms
- value 2: 612 ms
- value 3: 633 ms
Run 14: 1 warmup, 3 values, 1 loop
- warmup 1: 621 ms (-7%)
- value 1: 667 ms
- value 2: 651 ms
- value 3: 683 ms
Run 15: 1 warmup, 3 values, 1 loop
- warmup 1: 612 ms (-6%)
- value 1: 641 ms
- value 2: 656 ms
- value 3: 657 ms
Run 16: 1 warmup, 3 values, 1 loop
- warmup 1: 614 ms
- value 1: 635 ms
- value 2: 618 ms
- value 3: 669 ms
Run 17: 1 warmup, 3 values, 1 loop
- warmup 1: 611 ms (-12%)
- value 1: 643 ms (-8%)
- value 2: 760 ms (+9%)
- value 3: 690 ms
Run 18: 1 warmup, 3 values, 1 loop
- warmup 1: 619 ms
- value 1: 650 ms
- value 2: 605 ms (-5%)
- value 3: 656 ms
Run 19: 1 warmup, 3 values, 1 loop
- warmup 1: 608 ms
- value 1: 654 ms
- value 2: 615 ms
- value 3: 647 ms
Run 20: 1 warmup, 3 values, 1 loop
- warmup 1: 627 ms
- value 1: 634 ms
- value 2: 604 ms
- value 3: 646 ms
Run 21: 1 warmup, 3 values, 1 loop
- warmup 1: 636 ms
- value 1: 671 ms
- value 2: 629 ms
- value 3: 639 ms

mypy2: Mean +- std dev: 643 ms +- 27 ms

CC @mdboom (when you return). Does this mean that the benchmark needs to be renamed again?

@brandtbucher brandtbucher changed the title Run mypy the correct number of times Run mypy the correct number of times Aug 8, 2023
@markshannon
Copy link

markshannon commented Aug 9, 2023

My guess is that we need to run the benchmark an extra time in order to stabilize mypy's caches, although maybe we should be passing in some sort of "no cache" parameter instead.
So we probably do want to run the benchmark at least loop+2 times.

I don't see what is wrong with the code as it stands. It is recording loop runs; just range(2,loop+2) instead of range(1, loop+1)
I misread the code.

Maybe keep the if loop > 1 check but change the for loop to for i in range(loops + 2):.

@brandtbucher
Copy link
Contributor Author

brandtbucher commented Aug 9, 2023

The comment says it's only ignoring the first run, but it's really ignoring the first two. I agree that ignoring the first one is desirable.

I think it's just an off-by-one error in the conditional, though. Ignoring two runs seems excessive.

@mdboom
Copy link
Collaborator

mdboom commented Aug 28, 2023

This solution makes sense. I'll run an A:A test on our infrastructure just to confirm it helps (I think it should).

We probably should rename it again.

@brandtbucher
Copy link
Contributor Author

@mdboom, do you think we could get this in?

@mdboom mdboom merged commit ee8adbd into pyston:main Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants