-
Notifications
You must be signed in to change notification settings - Fork 900
CPU frequency-dependent timer issues starting with 2.0.2 #3003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
that could be a side effect of the timer change. can you measure the elapsed time of the osu_latency benchmark (ideally only zero size messages, and a loooot of iterations, so the elapsed time is ~1 minute) |
Funnily enough, the wall time does, indeed, seem to be roughly equal with the OSU test: ("userspace" at 1.2 GHz)
From a cursory glance at the HPCC source code, the times all seem to be measured using MPI_Wtime, as well. It could really be possible that the wall time never changed during longer tests and we just did not notice! But why would anyone really doubt the notion of time? |
my initial intuition was that the new timer is based on clock ticks, and time = ticks / frequency. |
In opal/mca/timer/linux/timer_linux_component.c, the frequency seems to be taken from /proc/cpuinfo which seems to be changed by acpi-cpufreq. The comment was probably true until the timer was changed:
Now that I think about it, this is particularly insidious when using "ondemand" with HPCC because all the tests are different and the CPU might run them at different frequencies. |
My changes to timers aren't in 2.0.x yet, and they shouldn't affect this. |
Here is a simple patch to disable TSC if it is not constant, and to fall back to a more costly but more accurate approach (clock_gettime if supported). diff --git a/opal/mca/timer/linux/timer_linux_component.c b/opal/mca/timer/linux/timer_linux_component.c
index 5c16ac9..64a55aa 100644
--- a/opal/mca/timer/linux/timer_linux_component.c
+++ b/opal/mca/timer/linux/timer_linux_component.c
@@ -97,7 +97,7 @@ find_info(FILE* fp, char *str, char *buf, size_t buflen)
return NULL;
}
-static int opal_timer_linux_find_freq(void)
+static int opal_timer_linux_find_freq(int* constant_tsc)
{
FILE *fp;
char *loc;
@@ -111,11 +111,15 @@ static int opal_timer_linux_find_freq(void)
}
opal_timer_linux_freq = 0;
-
+ *constant_tsc = 0;
#if OPAL_ASSEMBLY_ARCH == OPAL_ARM64
- opal_timer_linux_freq = opal_sys_timer_freq();
+ opal_timer_linux_freq = opal_sys_timer_freq();
#endif
-
+ loc = find_info(fp, "flags", buf, 1024);
+ if( NULL != loc) {
+ if( NULL != strstr(loc, "constant_tsc") )
+ *constant_tsc = 1;
+ }
if (0 == opal_timer_linux_freq) {
/* first, look for a timebase field. probably only on PPC,
but one never knows */
@@ -164,10 +168,10 @@ static int opal_timer_linux_find_freq(void)
int opal_timer_linux_open(void)
{
- int ret = OPAL_SUCCESS;
+ int ret = OPAL_SUCCESS, constant_tsc = 0;
if (mca_timer_base_monotonic && !opal_sys_timer_is_monotonic ()) {
-#if OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC)
+#if OPAL_HAVE_CLOCK_GETTIME
struct timespec res;
if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
opal_timer_linux_freq = 1.e3;
@@ -175,14 +179,28 @@ int opal_timer_linux_open(void)
opal_timer_base_get_usec = opal_timer_base_get_usec_clock_gettime;
return ret;
}
-#else
+#endif /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
/* Monotonic time requested but cannot be found. Complain! */
opal_show_help("help-opal-timer-linux.txt", "monotonic not supported", true);
-#endif /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
}
- ret = opal_timer_linux_find_freq();
- opal_timer_base_get_cycles = opal_timer_base_get_cycles_sys_timer;
- opal_timer_base_get_usec = opal_timer_base_get_usec_sys_timer;
+ ret = opal_timer_linux_find_freq(&constant_tsc);
+ if( ret >= 0 ) {
+ if( !(mca_timer_base_monotonic && (0 == constant_tsc)) ) {
+ opal_timer_base_get_cycles = opal_timer_base_get_cycles_sys_timer;
+ opal_timer_base_get_usec = opal_timer_base_get_usec_sys_timer;
+ return OPAL_SUCCESS;
+ }
+ }
+#if OPAL_HAVE_CLOCK_GETTIME
+ /* The frequency might not be constant, so we should not use it */
+ struct timespec res;
+ if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
+ opal_timer_linux_freq = 1.e3;
+ opal_timer_base_get_cycles = opal_timer_base_get_cycles_clock_gettime;
+ opal_timer_base_get_usec = opal_timer_base_get_usec_clock_gettime;
+ return OPAL_SUCCESS;
+ }
+#endif /* OPAL_HAVE_CLOCK_GETTIME */
return ret;
} |
Sorry, I do not have that much time to look into this, so I might get some stuff wrong. First of all, the CPU has the constant_tsc flag set. As far as I can see, the regression was introduced by #2596 because a checkout of b393c3b shows my issue and a5ab6eb does not. My current understanding is that "constant_tsc" means that all the CPU cores tick at a constant rate (I would assume 2.4 GHz for the E5-2680v4) independent of the frequency it is currently operating at and the commit referenced above makes use of that. I suspect that the issue arises when converting these CPU ticks to seconds for MPI_Wtime. From the few minutes I could spare looking into the code, I saw that the "cpu MHz" field in /proc/cpuinfo might be used to convert ticks to seconds. However, /proc/cpuinfo would definitely not be constant. It reflects the frequency set using acpi-cpufreq and might even change depending on CPU load when the "ondemand" governor is in use. If /proc/cpuinfo is used to convert ticks to seconds and the CPU ticks at a constant 2.4 GHz, setting the CPU to 1.2 GHz would therefore make Open MPI count two seconds for every 2.4e9 ticks. This would be consistent with the results of the OSU tests I ran. I do not know if it the conversion changes during an MPI run as the reference value might just get read once during startup. However, what is in /proc/cpuinfo can depend on the current CPU load. As I said, this could be incorrect as I am not familiar with the Open MPI code. (The value in /proc/cpuinfo also depends on whether the turbo is on or off. With turbo on, the frequency is written as 2401.000 MHz. I guess that would mean that there would be a drift of about 1.5 seconds per hour.) |
@jsquyres do we want a fix for this in 2.1.0? |
@hppritcha This is a bug. It looks like we have been using the wrong clock for some time. It is correct in most cases but if the user boosts the base clock I think we can get the wrong result. @AndreasKempfNEC Does your BIOS boost the base clock? I don't see any other way the tsc calculation could be wrong. |
I understand that "cpu Mhz" is a mere supposition. Basically we read the frequency of the first proc at the moment when the process is starting up (not even the core associated with the current process), and we assume this frequency remains unchanged for the rest of the application. The more we dig into this the more obvious it become that we should restrict our usage of the TSC to the only the cases where the scaling governor is set to performance, and the TSC is constant. In all other cases, we should fall back to a safer, and certainly more expensive, solution (clock_gettime). There is also the case where the governor is set to ondemand, in which case we could use the max frequency, and assume that all cores with MPI processes will be escalated to the max frequency. Or just rely on a more expensive, but more accurate, timer. |
if i understand correctly, with
so could using 3.60 GHz frequency (instead of 3.591947 GHz) be good enough ? |
We could, as this value is indeed supposed to be maximal. But then again, this only works if the governor switch the cores to max perf. |
did you mean "only works if the |
Intel® 64 and IA-32 Architectures Software Developer’s Manual |
Just as an FYI: there are several installations intending to field dynamic power mgmt in the next few years. These will definitely operate by adjusting freq on-the-fly, including adjustments on a per-cpu (or per-group-of-cpus) basis. So I don't think you'll be able to rely on a constant value. |
@rhc54 We can rely on a constant value for the tsc clock unless intel is thinking about abandoning the constant tsc. If they do we will detect it by the constant tsc bit not being set in the cpuid register. @zzzoom has the right approach for getting the constant tsc. I can take a crack at implementing that unless @zzzoom wants to do it. |
I can give it a shot over the weekend if it isn't urgent. |
What about all the others Linux-based machines (aka. the ones that are not based on IA32) ? |
Just for reference, the "cpu MHz" fields in /proc/cpuinfo of a Xeon CPU E5-2670 (Sandy Bridge) with two sockets and hyperthreading on using the "Intel P-State driver" (this one does not report clean values, and is probably what ggouaillardet has active) look like the following
But why did this only become a problem after b393c3b? Did the OPAL_TIMER_MONOTONIC stuff change the default timer? Reading the nominal frequency from a CPU register sounds very much like the correct solution. I will change the title of this report in the meantime. |
@hjelmn I wasn't trying to imply anything about future directions. I was only trying to indicate that you can't just take one value from /proc/cpuinfo and assume one can use it for the entire life of the process. I concur with the proposed register-based approach so long as you periodically update from it, at least for IA machines. |
Reading the TSC frequency is definitely the best (and only?) way to handle the constant_tsc case. However I've been toying around with CPUID leaf 15h, and this method only works for a subset of processors while recent ones require a hardcoded table because the data isn't available on the processor to query. |
Hi there, This problem is a deal breaker for us, but I need some of the other fixes that were made in 2.0.2. Is it sensible to go with the following revert until 2.0.3 (or 2.1.0) is out with the proper fix, please? I had to hack a couple of files about a bit... (a slow Thanks, Mark |
@bosilca, is that patch considered production ready or simply a hack? |
@yqin the patch above is a fix for cpu on which the |
I think this is what has me so confused. The initial report targeted the 2.x series - yet somehow, that code seems to be the one being pushed here. So is the code in v2.x wrong? If so, why would we want to push it into 1.10? Or did someone modify master and then backport the correct fix to everything but 1.10? |
@jsquyres Correct. We can not use |
I see - and the dead code block was left for future developers to consider (as per the comment). i have no objections to releasing a final 1.10.7 for cleanup purposes, if someone wants to assemble the PR. |
* See open-mpi#3003 for a discussion about this patch. Once we get a better version in place we can revert this change. Signed-off-by: Joshua Hursey <[email protected]>
better solution needed later workaround for open-mpi#3003 Signed-off-by: Howard Pritchard <[email protected]>
Assembling the PR, I took 1.10, then added 48d13aa and b933152 (visible at https://github.com/sjeaugey/ompi/tree/wtime-fix-v1.10). @hjelmn I see 4009ba6 is a bug fix but is it really related to this issue ? I can add it too, I'm just trying to avoid changing too many things. And in all cases, I'm not very comfortable with those "workaround patches" ( 48d13aa and b933152) and I'm wondering if simply reverting 5a44018 wouldn't roll back the situation to what it was in 1.10.5. @hjelmn, did we need that patch to fix something (or for some other code) or was it just meant as an improvement ? |
Okay, if the latest versions of all release branches are free from this problem, this report no longer needs to stay open. I guess I will close it pretty soon, then. Or somebody else can do so. |
@AndreasKempfNEC Probably want to wait for @sjeaugey's PR for the v1.10 branch -- that should close out the issue on all the current release branches. |
@jsquyres @hppritcha we see on some systems that MPI_Wtime() reports wrong results (due to cpu frequency scaling). Was why it decided to use clock_gettime() on Linux, and not gettimeofday()? |
There were user requests to make MPI_WTIME more accurate. All the branches should now have -- at least temporarily? -- set MPI_WTICK/MPI_WTIME to use |
@jsquyres according to https://github.com/open-mpi/ompi/blob/v2.x/ompi/mpi/c/wtime.c#L60 clock_gettime() is used on linux - what am i missing? |
@yosefe My mistake -- I was thinking about the fact that we commented out the You're saying that |
@jsquyres we are running osu_bandwidth benchmark which reported unreasonable [too good to be true] results on a particular system (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), and after changing MPI_Wtime() to gettimeofday() the results are OK. |
@bwbarrett @bosilca @hjelmn Have any of you guys heard of |
According to what I can find in Intel manuals, clock_gettime(CLOCK_MONOTONIC) is guaranteed to be independent of cpu freq starting with Nehalem in 2008. |
@yosefe I don't doubt your results, but per Intel's documentation, I have to wonder if something else was going on when you ran your tests...? Can you make a small example that shows that |
@jsquyres thanks, i'll try to dig some more into this. |
@jsquyres we checked it again and we hadolder version of OMPI which measured cycle count and not clock_gettime(). Also, a small clock_gettime() test showed the correct timing. So we are good, sorry for confusion. |
I created a Wiki page https://github.com/open-mpi/ompi/wiki/Timers to revisit this issue in the future. |
Yalla |
Dear Open MPI team,
A few days ago, my colleague Daniel Tameling noticed severe performance issues when running the HPCC benchmark with Open MPI. After spending quite some time tracking down the reason, we suspect that a regression was introduced between Open MPI 2.0.1 and 2.0.2. More specifically: the Open MPI 2.0.1 release tarball seems to be okay and the 2.0.2 release shows issues that persist through the openmpi-v2.0.x-201702170256-5fa504b nightly build.
The issue is that the new releases seem to be severely affected by the CPU frequency set using the acpi-cpufreq driver. When the "ondemand" governor is active and set to allow frequencies between 1.20 GHz and 2.40 GHz, the performance difference between Open MPI 2.0.1 and versions > 2.0.2 is almost a factor of two. Only when the governor "userspace" is used to pin the frequency to the maximum+turbo, the two versions show similar performance.
It does not seem to depend on the PML, BTL, MTL or even fabric. We tested FDR, EDR, openib, mxm, ob1, yalla, cm (on IB with Slurm), and cm, psm2 (on OPA with PBS).
The following latencies were measured on 2 nodes, 2 sockets Intel Xeon E5-2680v4 connected using InfiniBand EDR:
ompi-2.0.3-5fa504b, "ondemand" at 1.20 GHz-2.40 GHz:
ompi-2.0.1, "ondemand" at 1.20 GHz-2.40 GHz:
ompi-2.0.3-5fa504b, "userspace" at 1.8 GHz:
ompi-2.0.1, "userspace" at 1.8 GHz:
ompi-2.0.3-5fa504b, "userspace" at 2.4 GHz (turbo on):
ompi-2.0.1, "userspace" at 2.4 GHz (turbo on):
The Open MPI version (2.0.2a pre-release) in the HPC-X toolkit version 1.8.0 shows the same issues. Earlier releases (e.g., 1.10.2) also seem to be unaffected.
We are quite stumped as to what could be going on. (My gut feeling would be to blame the recent timer changes, but I really have no idea.)
In any case, thank you for your work on Open MPI!
The text was updated successfully, but these errors were encountered: