Skip to content

CPU frequency-dependent timer issues starting with 2.0.2 #3003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AndreasKempfNEC opened this issue Feb 17, 2017 · 72 comments
Closed

CPU frequency-dependent timer issues starting with 2.0.2 #3003

AndreasKempfNEC opened this issue Feb 17, 2017 · 72 comments

Comments

@AndreasKempfNEC
Copy link

Dear Open MPI team,

A few days ago, my colleague Daniel Tameling noticed severe performance issues when running the HPCC benchmark with Open MPI. After spending quite some time tracking down the reason, we suspect that a regression was introduced between Open MPI 2.0.1 and 2.0.2. More specifically: the Open MPI 2.0.1 release tarball seems to be okay and the 2.0.2 release shows issues that persist through the openmpi-v2.0.x-201702170256-5fa504b nightly build.

The issue is that the new releases seem to be severely affected by the CPU frequency set using the acpi-cpufreq driver. When the "ondemand" governor is active and set to allow frequencies between 1.20 GHz and 2.40 GHz, the performance difference between Open MPI 2.0.1 and versions > 2.0.2 is almost a factor of two. Only when the governor "userspace" is used to pin the frequency to the maximum+turbo, the two versions show similar performance.

It does not seem to depend on the PML, BTL, MTL or even fabric. We tested FDR, EDR, openib, mxm, ob1, yalla, cm (on IB with Slurm), and cm, psm2 (on OPA with PBS).

The following latencies were measured on 2 nodes, 2 sockets Intel Xeon E5-2680v4 connected using InfiniBand EDR:

ompi-2.0.3-5fa504b, "ondemand" at 1.20 GHz-2.40 GHz:

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw006:37946] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:47684] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[1487330514.524672] [hsw006:37959:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2401.00
[1487330514.522429] [hsw007:47691:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2401.00
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       2.23
1                       2.23
2                       2.20
4                       2.19
8                       2.19
16                      2.27
32                      2.28
64                      2.37
128                     3.21
256                     3.37
512                     3.61
1024                    3.98
2048                    4.76
4096                    6.41
8192                   10.12
16384                  16.37
32768                  20.39
65536                  26.40
131072                 37.37
262144                 59.33
524288                102.96
1048576               191.14
2097152               364.29
4194304               711.79

ompi-2.0.1, "ondemand" at 1.20 GHz-2.40 GHz:

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:37973] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:47714] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[1487330533.943496] [hsw006:37990:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2401.00
[1487330533.948853] [hsw007:47721:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 2401.00
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.15
1                       1.14
2                       1.13
4                       1.10
8                       1.10
16                      1.13
32                      1.14
64                      1.17
128                     1.58
256                     1.65
512                     1.77
1024                    1.96
2048                    2.35
4096                    3.23
8192                    5.05
16384                   8.17
32768                  10.21
65536                  13.27
131072                 18.74
262144                 29.59
524288                 51.32
1048576                95.44
2097152               182.00
4194304               355.76

ompi-2.0.3-5fa504b, "userspace" at 1.8 GHz:

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:41654] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:51373] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.74
1                       1.78
2                       1.78
4                       1.78
8                       1.78
16                      1.83
32                      1.84
64                      1.93
128                     2.54
256                     2.70
512                     2.93
1024                    3.29
2048                    4.04
4096                    5.62
8192                    9.20
16384                  12.25
32768                  14.89
65536                  18.98
131072                 26.13
262144                 41.60
524288                 69.95
1048576               128.21
2097152               244.13
4194304               475.81

ompi-2.0.1, "userspace" at 1.8 GHz:

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:41690] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:51407] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.27
1                       1.30
2                       1.30
4                       1.30
8                       1.30
16                      1.35
32                      1.35
64                      1.38
128                     1.86
256                     1.97
512                     2.14
1024                    2.43
2048                    2.99
4096                    4.23
8192                    6.81
16384                   9.12
32768                  11.11
65536                  14.12
131072                 19.51
262144                 30.47
524288                 52.76
1048576                95.91
2097152               182.81
4194304               356.55

ompi-2.0.3-5fa504b, "userspace" at 2.4 GHz (turbo on):

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:45372] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:55141] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.09
1                       1.10
2                       1.10
4                       1.09
8                       1.09
16                      1.14
32                      1.14
64                      1.17
128                     1.60
256                     1.68
512                     1.79
1024                    1.98
2048                    2.37
4096                    3.21
8192                    5.07
16384                   8.20
32768                  10.20
65536                  13.22
131072                 18.69
262144                 29.67
524288                 51.43
1048576                95.50
2097152               182.12
4194304               355.87

ompi-2.0.1, "userspace" at 2.4 GHz (turbo on):

$ mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ~/opt/osu-5.3-ompi2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency'
[snip warning]
[hsw006:45403] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw007:55175] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test v5.3
# Size          Latency (us)
0                       1.10
1                       1.11
2                       1.09
4                       1.09
8                       1.08
16                      1.13
32                      1.13
64                      1.16
128                     1.57
256                     1.65
512                     1.76
1024                    1.96
2048                    2.35
4096                    3.23
8192                    5.04
16384                   8.15
32768                  10.17
65536                  13.21
131072                 18.71
262144                 29.55
524288                 51.31
1048576                95.37
2097152               182.00
4194304               355.76

The Open MPI version (2.0.2a pre-release) in the HPC-X toolkit version 1.8.0 shows the same issues. Earlier releases (e.g., 1.10.2) also seem to be unaffected.

We are quite stumped as to what could be going on. (My gut feeling would be to blame the recent timer changes, but I really have no idea.)

In any case, thank you for your work on Open MPI!

@ggouaillardet
Copy link
Contributor

that could be a side effect of the timer change.
and what you observe could be a higher latency, or incorrect timers suggesting a higher latency (!)

can you measure the elapsed time of the osu_latency benchmark (ideally only zero size messages, and a loooot of iterations, so the elapsed time is ~1 minute)
then compare the elapsed time of the other ompi release.
if both elapsed time are similar, but reported performance has a 2x difference, then
the issue is the timer is incorrect, and not the performance has dropped.

@AndreasKempfNEC
Copy link
Author

Funnily enough, the wall time does, indeed, seem to be roughly equal with the OSU test:

("userspace" at 1.2 GHz)

$ time ~/opt/ompi-2.0.3-5fa504b/bin/mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ./a.out -i 30000000'; time ~/opt/ompi-2.0.1-check/bin/mpirun -n 2 -mca pml yalla -mca rmaps_dist_device mlx5_0:1 -mca coll_hcoll_enable 0 -x MXM_IB_PORTS=mlx5_0:1 -x MXM_TLS=rc,self,shm -mca rmaps_base_mapping_policy dist:span -map-by node --report-bindings bash -c 'ulimit -s 10240; ./a.out -i 30000000';
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw010:17030] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw011:00649] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test
# Size            Latency (us)
0                         3.04

real	1m32.585s
user	1m31.418s
sys	0m0.250s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[hsw010:17091] MCW rank 0 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
[hsw011:00714] MCW rank 1 bound to socket 1[core 14[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../..]
# OSU MPI Latency Test
# Size            Latency (us)
0                         1.50

real	1m31.565s
user	1m30.389s
sys	0m0.233s

From a cursory glance at the HPCC source code, the times all seem to be measured using MPI_Wtime, as well. It could really be possible that the wall time never changed during longer tests and we just did not notice! But why would anyone really doubt the notion of time?

@ggouaillardet
Copy link
Contributor

my initial intuition was that the new timer is based on clock ticks, and time = ticks / frequency.
if the cpu clock increases, then you get more ticks for the same elapsed time.
since we do not seem to adjust the frequency, then MPI_Wtime() runs too fast, so the reported latency is higher than what it really is.
for the time being, i suggest you force timer based on gettimeofday()
(i am afk, and do not know how to do that off hand)

@hjelmn @bosilca can you please comment on that issue ?

@AndreasKempfNEC
Copy link
Author

AndreasKempfNEC commented Feb 17, 2017

In opal/mca/timer/linux/timer_linux_component.c, the frequency seems to be taken from /proc/cpuinfo which seems to be changed by acpi-cpufreq. The comment was probably true until the timer was changed:

    fp = fopen("/proc/cpuinfo", "r");
    ...
    if (0 == opal_timer_linux_freq) {
        /* find the CPU speed - most timers are 1:1 with CPU speed */
        loc = find_info(fp, "cpu MHz", buf, 1024);
        if (NULL != loc) {
            ret = sscanf(loc, "%f", &cpu_f);
            if (1 == ret) {
                /* numer is in MHz - convert to Hz and make an integer */
                opal_timer_linux_freq = (opal_timer_t) (cpu_f * 1000000);
            }
        }
    }

Now that I think about it, this is particularly insidious when using "ondemand" with HPCC because all the tests are different and the CPU might run them at different frequencies.

@zzzoom
Copy link
Contributor

zzzoom commented Feb 21, 2017

My changes to timers aren't in 2.0.x yet, and they shouldn't affect this.
Newer processors increment the TSC at the same rate regardless of frequency, @AndreasKempfNEC does your /proc/cpuinfo list constant_tsc in its flags? If it doesn't, the problem might be that the tick rate is being queried on init before the processor increases its frequency due to load.

@bosilca
Copy link
Member

bosilca commented Feb 21, 2017

Here is a simple patch to disable TSC if it is not constant, and to fall back to a more costly but more accurate approach (clock_gettime if supported).

diff --git a/opal/mca/timer/linux/timer_linux_component.c b/opal/mca/timer/linux/timer_linux_component.c
index 5c16ac9..64a55aa 100644
--- a/opal/mca/timer/linux/timer_linux_component.c
+++ b/opal/mca/timer/linux/timer_linux_component.c
@@ -97,7 +97,7 @@ find_info(FILE* fp, char *str, char *buf, size_t buflen)
     return NULL;
 }

-static int opal_timer_linux_find_freq(void)
+static int opal_timer_linux_find_freq(int* constant_tsc)
 {
     FILE *fp;
     char *loc;
@@ -111,11 +111,15 @@ static int opal_timer_linux_find_freq(void)
     }

     opal_timer_linux_freq = 0;
-
+    *constant_tsc = 0;
 #if OPAL_ASSEMBLY_ARCH == OPAL_ARM64
-       opal_timer_linux_freq = opal_sys_timer_freq();
+    opal_timer_linux_freq = opal_sys_timer_freq();
 #endif
-
+    loc = find_info(fp, "flags", buf, 1024);
+    if( NULL != loc) {
+        if( NULL != strstr(loc, "constant_tsc") )
+            *constant_tsc = 1;
+    }
     if (0 == opal_timer_linux_freq) {
         /* first, look for a timebase field.  probably only on PPC,
            but one never knows */
@@ -164,10 +168,10 @@ static int opal_timer_linux_find_freq(void)

 int opal_timer_linux_open(void)
 {
-    int ret = OPAL_SUCCESS;
+    int ret = OPAL_SUCCESS, constant_tsc = 0;

     if (mca_timer_base_monotonic && !opal_sys_timer_is_monotonic ()) {
-#if OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC)
+#if OPAL_HAVE_CLOCK_GETTIME
         struct timespec res;
         if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
             opal_timer_linux_freq = 1.e3;
@@ -175,14 +179,28 @@ int opal_timer_linux_open(void)
             opal_timer_base_get_usec = opal_timer_base_get_usec_clock_gettime;
             return ret;
         }
-#else
+#endif  /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
         /* Monotonic time requested but cannot be found. Complain! */
         opal_show_help("help-opal-timer-linux.txt", "monotonic not supported", true);
-#endif  /* OPAL_HAVE_CLOCK_GETTIME && (0 == OPAL_TIMER_MONOTONIC) */
     }
-    ret = opal_timer_linux_find_freq();
-    opal_timer_base_get_cycles = opal_timer_base_get_cycles_sys_timer;
-    opal_timer_base_get_usec = opal_timer_base_get_usec_sys_timer;
+    ret = opal_timer_linux_find_freq(&constant_tsc);
+    if( ret >= 0 ) {
+        if( !(mca_timer_base_monotonic && (0 == constant_tsc)) ) {
+            opal_timer_base_get_cycles = opal_timer_base_get_cycles_sys_timer;
+            opal_timer_base_get_usec = opal_timer_base_get_usec_sys_timer;
+            return OPAL_SUCCESS;
+        }
+    }
+#if OPAL_HAVE_CLOCK_GETTIME
+    /* The frequency might not be constant, so we should not use it */
+    struct timespec res;
+    if( 0 == clock_getres(CLOCK_MONOTONIC, &res)) {
+        opal_timer_linux_freq = 1.e3;
+        opal_timer_base_get_cycles = opal_timer_base_get_cycles_clock_gettime;
+        opal_timer_base_get_usec = opal_timer_base_get_usec_clock_gettime;
+        return OPAL_SUCCESS;
+    }
+#endif  /* OPAL_HAVE_CLOCK_GETTIME */
     return ret;
 }

@AndreasKempfNEC
Copy link
Author

Sorry, I do not have that much time to look into this, so I might get some stuff wrong.

First of all, the CPU has the constant_tsc flag set. As far as I can see, the regression was introduced by #2596 because a checkout of b393c3b shows my issue and a5ab6eb does not.

My current understanding is that "constant_tsc" means that all the CPU cores tick at a constant rate (I would assume 2.4 GHz for the E5-2680v4) independent of the frequency it is currently operating at and the commit referenced above makes use of that. I suspect that the issue arises when converting these CPU ticks to seconds for MPI_Wtime. From the few minutes I could spare looking into the code, I saw that the "cpu MHz" field in /proc/cpuinfo might be used to convert ticks to seconds. However, /proc/cpuinfo would definitely not be constant. It reflects the frequency set using acpi-cpufreq and might even change depending on CPU load when the "ondemand" governor is in use. If /proc/cpuinfo is used to convert ticks to seconds and the CPU ticks at a constant 2.4 GHz, setting the CPU to 1.2 GHz would therefore make Open MPI count two seconds for every 2.4e9 ticks. This would be consistent with the results of the OSU tests I ran. I do not know if it the conversion changes during an MPI run as the reference value might just get read once during startup. However, what is in /proc/cpuinfo can depend on the current CPU load.

As I said, this could be incorrect as I am not familiar with the Open MPI code.

(The value in /proc/cpuinfo also depends on whether the turbo is on or off. With turbo on, the frequency is written as 2401.000 MHz. I guess that would mean that there would be a drift of about 1.5 seconds per hour.)

@hppritcha
Copy link
Member

@jsquyres do we want a fix for this in 2.1.0?

@hjelmn
Copy link
Member

hjelmn commented Feb 23, 2017

@hppritcha This is a bug. It looks like we have been using the wrong clock for some time. It is correct in most cases but if the user boosts the base clock I think we can get the wrong result.

@AndreasKempfNEC Does your BIOS boost the base clock? I don't see any other way the tsc calculation could be wrong.

@bosilca
Copy link
Member

bosilca commented Feb 23, 2017

I understand that "cpu Mhz" is a mere supposition. Basically we read the frequency of the first proc at the moment when the process is starting up (not even the core associated with the current process), and we assume this frequency remains unchanged for the rest of the application. The more we dig into this the more obvious it become that we should restrict our usage of the TSC to the only the cases where the scaling governor is set to performance, and the TSC is constant. In all other cases, we should fall back to a safer, and certainly more expensive, solution (clock_gettime).

There is also the case where the governor is set to ondemand, in which case we could use the max frequency, and assume that all cores with MPI processes will be escalated to the max frequency. Or just rely on a more expensive, but more accurate, timer.

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Feb 23, 2017

if i understand correctly, with constant_tsc_flag the ticks occur at the processor nominal speed.
could we try to parse the model name value of /proc/cpuinfo ?
for example, on my VM

model name	: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
cpu MHz		: 3591.947

so could using 3.60 GHz frequency (instead of 3.591947 GHz) be good enough ?

@bosilca
Copy link
Member

bosilca commented Feb 23, 2017

We could, as this value is indeed supposed to be maximal. But then again, this only works if the governor switch the cores to max perf.

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Feb 23, 2017

did you mean "only works if the constant_tsc_flag is set" instead ?
or am i missing something ?

@zzzoom
Copy link
Contributor

zzzoom commented Feb 23, 2017

Intel® 64 and IA-32 Architectures Software Developer’s Manual
18.18.3 Determining the Processor Base Frequency
For Intel processors in which the nominal core crystal clock frequency is enumerated in CPUID.15H.ECX and the core crystal clock ratio is encoded in CPUID.15H (see Table 3-8 “Information Returned by CPUID Instruction”), the nominal TSC frequency can be determined by using the following equation:
Nominal TSC frequency = ( CPUID.15H.ECX[31:0] * CPUID.15H.EBX[31:0] ) ÷ CPUID.15H.EAX[31:0]

@rhc54
Copy link
Contributor

rhc54 commented Feb 23, 2017

Just as an FYI: there are several installations intending to field dynamic power mgmt in the next few years. These will definitely operate by adjusting freq on-the-fly, including adjustments on a per-cpu (or per-group-of-cpus) basis. So I don't think you'll be able to rely on a constant value.

@hjelmn
Copy link
Member

hjelmn commented Feb 23, 2017

@rhc54 We can rely on a constant value for the tsc clock unless intel is thinking about abandoning the constant tsc. If they do we will detect it by the constant tsc bit not being set in the cpuid register. @zzzoom has the right approach for getting the constant tsc. I can take a crack at implementing that unless @zzzoom wants to do it.

@zzzoom
Copy link
Contributor

zzzoom commented Feb 23, 2017

I can give it a shot over the weekend if it isn't urgent.

@bosilca
Copy link
Member

bosilca commented Feb 23, 2017

What about all the others Linux-based machines (aka. the ones that are not based on IA32) ?

@AndreasKempfNEC
Copy link
Author

Just for reference, the "cpu MHz" fields in /proc/cpuinfo of a Xeon CPU E5-2670 (Sandy Bridge) with two sockets and hyperthreading on using the "Intel P-State driver" (this one does not report clean values, and is probably what ggouaillardet has active) look like the following

$ grep 'cpu MHz' /proc/cpuinfo 
cpu MHz		: 1873.523 [<- so this one would be taken for the conversion, I guess?]
cpu MHz		: 1550.757
cpu MHz		: 1898.914
cpu MHz		: 2189.890
cpu MHz		: 1282.226
cpu MHz		: 1212.757
cpu MHz		: 1481.085
cpu MHz		: 1398.921
cpu MHz		: 1641.046
cpu MHz		: 1516.429
cpu MHz		: 2069.437
cpu MHz		: 1900.640
cpu MHz		: 1294.007
cpu MHz		: 1497.234
cpu MHz		: 1255.210
cpu MHz		: 1428.781
cpu MHz		: 1510.539
cpu MHz		: 1220.171
cpu MHz		: 2089.445
cpu MHz		: 1801.007
cpu MHz		: 1221.796
cpu MHz		: 1202.398
cpu MHz		: 1674.257
cpu MHz		: 1301.929
cpu MHz		: 2600.000
cpu MHz		: 2115.242
cpu MHz		: 1320.921
cpu MHz		: 1201.789
cpu MHz		: 1252.773
cpu MHz		: 1199.960
cpu MHz		: 2338.375
cpu MHz		: 1214.789

But why did this only become a problem after b393c3b? Did the OPAL_TIMER_MONOTONIC stuff change the default timer?

Reading the nominal frequency from a CPU register sounds very much like the correct solution.

I will change the title of this report in the meantime.

@AndreasKempfNEC AndreasKempfNEC changed the title CPU frequency-dependent performance issues starting with 2.0.2 CPU frequency-dependent timer issues starting with 2.0.2 Feb 23, 2017
@rhc54
Copy link
Contributor

rhc54 commented Feb 23, 2017

@hjelmn I wasn't trying to imply anything about future directions. I was only trying to indicate that you can't just take one value from /proc/cpuinfo and assume one can use it for the entire life of the process. I concur with the proposed register-based approach so long as you periodically update from it, at least for IA machines.

@zzzoom
Copy link
Contributor

zzzoom commented Feb 27, 2017

Reading the TSC frequency is definitely the best (and only?) way to handle the constant_tsc case. However I've been toying around with CPUID leaf 15h, and this method only works for a subset of processors while recent ones require a hardcoded table because the data isn't available on the processor to query.
A good implementation of this would need to be kept up to date because Intel seems to change this randomly every generation or so (e.g. the linux kernel implementation has a new hardcoded entry for Skylake-X which hasn't even been released yet), and that doesn't look to me like something that Open MPI should need to be concerned about.
So yeah, tl;dr it can be done, but it would be ugly, and we should probably be prodding linux kernel/glibc/some hardware library devs to export this information (if it isn't exported somewhere yet) because they need to maintain it anyway.

@ccaamad
Copy link

ccaamad commented Mar 1, 2017

Hi there,

This problem is a deal breaker for us, but I need some of the other fixes that were made in 2.0.2. Is it sensible to go with the following revert until 2.0.3 (or 2.1.0) is out with the proper fix, please? I had to hack a couple of files about a bit...

ccaamad/ompi@0fdc34b

(a slow MPI_Wtime is better than a broken MPI_Wtime!)

Thanks,

Mark

@yqin
Copy link

yqin commented Mar 15, 2017

@bosilca, is that patch considered production ready or simply a hack?

@ggouaillardet
Copy link
Contributor

@yqin the patch above is a fix for cpu on which the constant_tsc flag is not set.
in this case, the flag is set, and the root cause is cpu frequency is not nominal
(governor is ondemand or userspace)
so depending on your configuration, the patch might or might not help

@rhc54
Copy link
Contributor

rhc54 commented May 3, 2017

I think this is what has me so confused. The initial report targeted the 2.x series - yet somehow, that code seems to be the one being pushed here. So is the code in v2.x wrong? If so, why would we want to push it into 1.10? Or did someone modify master and then backport the correct fix to everything but 1.10?

@hjelmn
Copy link
Member

hjelmn commented May 3, 2017

@jsquyres Correct. We can not use clock_gettime() for the internal timer used in opal_progress(). In v1.10.0-v1.10.1 we used the tsc for both opal_progress() and wtime. A user noticed that the clock was non-monotonic on an older x86 system so we disabled it in v1.10.2. I added a check for the monotonic tsc and re-enabled the tsc for v1.10.3. This issue came up shortly after that because the way we calculate the tsc frequency is wrong in some cases. In response we made a couple of changes to master (and v2.0.x, v2.x, etc) to make wtime always use gettimeofday() or clock_gettime() and always use a cycle counter (if available) for opal_progress(). This last set of changes is apparently not in v1.10.x and @sjeaugey is requesting that we make a v1.10.7 with it.

@rhc54
Copy link
Contributor

rhc54 commented May 3, 2017

I see - and the dead code block was left for future developers to consider (as per the comment).

i have no objections to releasing a final 1.10.7 for cleanup purposes, if someone wants to assemble the PR.

sjeaugey pushed a commit to sjeaugey/ompi that referenced this issue May 4, 2017
 * See open-mpi#3003 for a discussion about
   this patch. Once we get a better version in place we can revert this
   change.

Signed-off-by: Joshua Hursey <[email protected]>
sjeaugey pushed a commit to sjeaugey/ompi that referenced this issue May 4, 2017
better solution needed later
 workaround for open-mpi#3003

Signed-off-by: Howard Pritchard <[email protected]>
@sjeaugey
Copy link
Member

sjeaugey commented May 4, 2017

Assembling the PR, I took 1.10, then added 48d13aa and b933152 (visible at https://github.com/sjeaugey/ompi/tree/wtime-fix-v1.10).

@hjelmn I see 4009ba6 is a bug fix but is it really related to this issue ? I can add it too, I'm just trying to avoid changing too many things.

And in all cases, I'm not very comfortable with those "workaround patches" ( 48d13aa and b933152) and I'm wondering if simply reverting 5a44018 wouldn't roll back the situation to what it was in 1.10.5. @hjelmn, did we need that patch to fix something (or for some other code) or was it just meant as an improvement ?

@hjelmn
Copy link
Member

hjelmn commented May 5, 2017

@sjeaugey We absolutely can not roll back 5a44018. That will re-introduce the performance regression. The work-around patches are ok and well contained. I don't think we need 4009ba6 but it is a nice to have.

@AndreasKempfNEC
Copy link
Author

Okay, if the latest versions of all release branches are free from this problem, this report no longer needs to stay open.

I guess I will close it pretty soon, then. Or somebody else can do so.

@jsquyres
Copy link
Member

jsquyres commented May 5, 2017

@AndreasKempfNEC Probably want to wait for @sjeaugey's PR for the v1.10 branch -- that should close out the issue on all the current release branches.

@yosefe
Copy link
Contributor

yosefe commented May 16, 2017

@jsquyres @hppritcha we see on some systems that MPI_Wtime() reports wrong results (due to cpu frequency scaling). Was why it decided to use clock_gettime() on Linux, and not gettimeofday()?
Can we go with the initial suggestion to "nail down MPI_WTICK/MPI_WTIME to use gettimeofday(3)"?

@jsquyres
Copy link
Member

There were user requests to make MPI_WTIME more accurate.

All the branches should now have -- at least temporarily? -- set MPI_WTICK/MPI_WTIME to use gettimeofday(3).

@yosefe
Copy link
Contributor

yosefe commented May 16, 2017

@jsquyres according to https://github.com/open-mpi/ompi/blob/v2.x/ompi/mpi/c/wtime.c#L60 clock_gettime() is used on linux - what am i missing?

@jsquyres
Copy link
Member

@yosefe My mistake -- I was thinking about the fact that we commented out the opal_* calls to get the times, which definitely had problems with frequency scaling.

You're saying that clock_gettime() itself is showing frequency scaling issues. That is new information to me. Can you show the code where this is happening?

@yosefe
Copy link
Contributor

yosefe commented May 18, 2017

@jsquyres we are running osu_bandwidth benchmark which reported unreasonable [too good to be true] results on a particular system (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), and after changing MPI_Wtime() to gettimeofday() the results are OK.
Can't point to the exact code in clock_gettime which is causing this, though...

@jsquyres
Copy link
Member

@bwbarrett @bosilca @hjelmn Have any of you guys heard of clock_gettime() suffering from frequency scaling issues?

@rhc54
Copy link
Contributor

rhc54 commented May 18, 2017

According to what I can find in Intel manuals, clock_gettime(CLOCK_MONOTONIC) is guaranteed to be independent of cpu freq starting with Nehalem in 2008.

@jsquyres
Copy link
Member

@yosefe I don't doubt your results, but per Intel's documentation, I have to wonder if something else was going on when you ran your tests...? Can you make a small example that shows that clock_gettime(CLOCK_MONOTONIC) is dependent on CPU frequency?

@yosefe
Copy link
Contributor

yosefe commented May 19, 2017

@jsquyres thanks, i'll try to dig some more into this.

@yosefe
Copy link
Contributor

yosefe commented May 24, 2017

@jsquyres we checked it again and we hadolder version of OMPI which measured cycle count and not clock_gettime(). Also, a small clock_gettime() test showed the correct timing. So we are good, sorry for confusion.

@kawashima-fj
Copy link
Member

I created a Wiki page https://github.com/open-mpi/ompi/wiki/Timers to revisit this issue in the future.

@rivolt
Copy link

rivolt commented Aug 24, 2020

Yalla

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests