-
Notifications
You must be signed in to change notification settings - Fork 900
Performance regression in 1.10.3 #2591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Bisect shows for 1.10 the regression starts with 9f2a6da. This commit forces the use of clock_gettime () over the rtdtsc instruction. IMHO this is a really bad idea. This forces us to call clock_gettime () on every single call to opal_progress (). @bosilca We do not need a monotonic timer for opal_progress (), correct? If not we should force opal_progress to always use the asm timer and not clock_gettime (). |
@rhc54, @jladd-mlnx. I would think this would affect message rates as well. The pre-1.10.3 behavior can be forced using --mca timer_require_monotonic 0. Might be worth rerunning some of the performance regressions with this option set. |
@hjelmn I think we can survive in opal_progress without a monotonic timer, in which case we might run the libevent progress at an erratic rate. However, this will force us to expose 2 timers, one that is monotonic and one that might not be, and we do not have right now the infrastructure. |
@bosilca. Yeah. Not sure how to handle that. For now I can do something for newer Intel processors. There is a cpuid bit to indicate whether the cpu has a core invariant tsc. I am adding a function to check that bit. If it is set it will use the rtdtsc instruction as a monotonic timer. |
👍 |
Iirc, we moved to clock_gettime() because MPI_Wtime() could go back in time if a (thread of a) task migrates within a node. |
@ggouaillardet I agree that we should have two timers. Working on that change over the weekend. The tsc core-invariance test is a run-time test (at timer component open) and gets us back the performance on Haswell and Broadwell (and probably others). See #2596. |
not sure why this didn't close |
@hjelmn is working on identifying a performance regression that was introduced in 1.10.3. He doesn't have any further info yet beyond the fact that one of their apps runs a bunch slower with 1.10.3 vs 1.10.2.
He thinks this will also affect v2.0.x and v2.x.
This is potentially a blocker for v1.10.5 and v2.0.2.
@hjelmn Says he'll have more information shortly.
The text was updated successfully, but these errors were encountered: