Skip to content

Add integer-based access to MPI_Wtime #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mahermanns opened this issue Feb 1, 2018 · 17 comments
Closed

Add integer-based access to MPI_Wtime #77

mahermanns opened this issue Feb 1, 2018 · 17 comments
Labels
scheduled reading Reading is scheduled for the next meeting wg-tools Tools Working Group

Comments

@mahermanns
Copy link
Member

Problem

MPI provides standardized access to a time source through MPI_Wtime(), however, the returned timestamp is a floating-point number based on seconds since some time in the past. If that time in the past is significantly far in the past, the floating point value loses resolution. Furthermore, most common time sources are integer-based, thus the time information needs to be converted to a floating-point value with additional effort.

Proposal

Provide two additional calls returning integer values for ticks since some time in the past and ticks per second. The time source should be the same as that for MPI_Wtime.

Changes to the Text

See the corresponding pull request.

Impact on Implementations

Implementations need to support the additional two function calls.

Impact on Users

Users can access integer-based timing information, with potentially lower overhead, while still benefiting from the convenient floating-point interface in less time/overhead-critical parts of the code (e.g., printf, write).

References

Tools Ticket: mpiwg-tools/tools-issues#8

@mahermanns mahermanns added not ready wg-tools Tools Working Group labels Feb 1, 2018
@jeffhammond
Copy link
Member

jeffhammond commented Feb 1, 2018

I have many objections:

  • I do not see how 64-bit floating-point numbers of seconds can lose precision in the scenario you describe. Can you give an example where this happens? If the difference of two very large numbers is a problem, then the implementation is permitted (and should!) start the counter at zero when the job starts.

  • Both seconds per tick and ticks per second are meaningless quantities in the context of variable-frequency processors. The duration over which these might be meaningful is less than a millisecond.

  • We already have MPI_Wtick. I see no reason to add a new function that returns the same information, albeit inverted and cast to an integer. And, as noted in the previous bullet, MPI_Wtick does not return a well-defined quantity anyways.

  • Good time interfaces like POSIX clock_gettime return an integer number of seconds and nanoseconds, rather than "ticks", because nanoseconds are a meaningful term.

  • If programmers want to count "ticks" on a machine where this is meaningful (e.g. Blue Gene), they should use the machine-specific implementation. MPI should not attempt to standardize a feature that is not portable.

@bosilca
Copy link
Member

bosilca commented Feb 2, 2018

Interesting idea. I wondered what accuracy the user can get via the double timer depending on the OS timer accuracy. For us timers (1e-6) it will take about 544 years before losing one us. Going for ns (1e-9) the result is more drastic, it will take 194 days before losing one ns. Thus, counting the time from the MPI job start as proposed by @jeffhammond, will give us 194 days before losing our first nano-second, and from there the accuracy will sharply decline.

@mahermanns
Copy link
Member Author

@jeffhammond I agree that counting the time from the MPI job start does take care of the accuracy issue. Regarding the other objections, I did not necessarily mean clock ticks of a varying frequency processor, but rather the abstract tick of a "clock on the wall", much like MPI_Wtick does not reflect the processor speed at the moment either.

The proposal is not trying to provide a new time source, but rather a complementary interface to the existing time source. As you mention, interfaces such as clock_gettime return time as an integer-based entity, and some performance tools use this timer already as is. The background of this proposal is a larger ticket that I currently only have an tools WG ticket for (mpiwg-tools/tools-issues#11), where the MPI implementation will return time-stamped information to the user/tool. When the tool uses an integer-based time format, it could use this call, if it uses a double-based time format it could use MPI_Wtime as before. If the MPI implementation uses clock_gettime as its source, the tool would avoid the conversion from integer to double and back to integer.

@jeffhammond
Copy link
Member

@bosilca Well, I suppose if MPI fault-tolerance works as designed, people might actually attempt to run jobs for 6 months 😮

@mahermanns Indeed, @jdinan corrected my misunderstanding of "tick".

Since clock_gettime is sufficient, why can't tools just use that? In the past, MPI has added wrappers for a wide range of POSIX and ISO language features because we don't assume these are present, but it's lead to quite a bit of ugliness. In many cases, there is a good justification "because (legacy) Fortran", but I recall the MPI_T interface doesn't support Fortran and - for good reason - targets C/C++ code. Are we really concerned that MPI tool developers can't use clock_gettime that we need to provide another trivial wrapper for a POSIX call?

@mahermanns
Copy link
Member Author

@jeffhammond If you take a look at mpiwg-tools/tools-issues#11 and the corresponding branch https://github.com/mpiwg-tools/mpi-standard/tree/issue_11_mpi_t_events, the plan here is that instead of a tool querying clock_gettime or whatever time source they use internally within a callback function, the tool can use a special call to query the time an event happened. The MPI runtime can then either provide the current time (as the tool would have done with its own timer) or provide a time of the past, if the event was buffered internally---all transparent to the tool querying the event's timestamp.

The desire here was to use an integer-based timestamp (i.e., same type as current timing routines in the backend), as that might enable a lower overhead, when we don't need to convert the time into double (and potentially back into an integer, if the tool's timestamps are integer-based). To put the event time source into relation of the tool's other time sources, the tool would need to query reference timestamps at some point (e.g., beginning and end of the measurement).

@jdinan
Copy link

jdinan commented Feb 5, 2018

To correct an earlier comment by @jeffhammond --

Both seconds per tick and ticks per second are meaningless quantities in the context of variable-frequency processors. The duration over which these might be meaningful is less than a millisecond.

In current Intel processors, the timestamp counter (accessed via the rdtsc instruction) ticks at the nominal processor frequency, regardless of power saving or frequency boosting measures. Other common time sources, e.g. RTC and HPET, are also independent.

Regarding the suggestion that Linux clock_gettime is sufficient, you might be disappointed if you look at how it's implemented: http://linuxmogeb.blogspot.com/2013/10/how-does-clockgettime-work.html (for the spoiler, skip to the conclusion at the end of the article).

@jeffhammond
Copy link
Member

Even though I work for Intel, I do not support making decisions about the MPI standard based upon the fact that Intel got this right starting in 2008.

And now that I've seen exactly how much of a pain it is to implement MPI_Wti** properly in Open-MPI, I'm even more reluctant to want to add new features to the standard of a similar nature.

@mahermanns mahermanns added scheduled reading Reading is scheduled for the next meeting and removed not ready labels Feb 14, 2018
@jdinan
Copy link

jdinan commented Feb 22, 2018

@mahermanns Is "The number of ticks per second must be constant over the execution of the program" intended to specify that the number of ticks elapsed is monotonic increasing? If so, I'm having trouble convincing myself that this is sufficient. When you make adjustments to the clock to maintain MPI_WTIME_IS_GLOBAL I think you are adjusting the number of seconds that have elapsed, not the number of ticks per second.

A second question -- why tie this new routine to the resolution of MPI_Wtick? If I implement this routine using a different time source (e.g. the processor timestamp counter), I may need to reduce the resolution to match the time source being used by MPI_Wtime.

@mahermanns
Copy link
Member Author

@jdinan The phrase is to ensure that the call always reports the same ticks per second during the run, i.e., a tool can query that at the beginning of a run and does not have to query it again.

Explicit mention of monotonic time is not part of this proposal, as I did not want to overload it (separation of concerns) ... of course, bad things happen to a number of tools when the time is not monotonically increasing and as a tools developer I would like a way to ensure this. I think we talked about this in Aachen and the idea was that the synchronization is only allowed to re-set the clocks to a future time.

@jdinan
Copy link

jdinan commented Feb 27, 2018

Could you clarify the problem solved by the proposed API? As @bosilca mentioned earlier, a good implementation of MPI_Wtime will take 194 days to lose a nanosecond. Is the problem that MPI_Wtick is not constant? If so, would this new routine indirectly change the semantics of MPI_Wtick?

@mahermanns
Copy link
Member Author

The API enables tools to obtain a low-overhead timestamps without needing the conversion to double and back again.

At the moment, MPI_Wtime is mostly used (in my experience) by users manually instrumenting their main loops, etc., where overhead is negligible.

If tools want to use the MPI timer for events recording (as needed by MPI_T events), calls to the MPI internal timing will be more frequent. As current tools often use an integer-based timestamp internally, using the current interface to an "MPI time" would imply (1) MPI getting an integer-based time from an interface like clock_gettime (or whatever is available on the platform) and (2) converting that to double, and (3) the tool converting that back to integer to store internally. The new API provides access to the integer-based timestamp directly, without changing any existing semantics of the timer in general.

@jdinan
Copy link

jdinan commented Feb 28, 2018

Have you compared the overhead of these two routines? It's not obvious to me that the proposed routine will substantially reduce the overhead relative to MPI_Wtime; most processors will perform these type conversions with a single instruction. It sounds like the right solution for a tool that wants low overhead and high precision would still be to use the processor's timestamp counter. I'm not strongly opposed to the API being proposed, but it sounds to me like it still won't solve the problem.

@dholmes-epcc-ed-ac-uk
Copy link
Member

@jdinan The SpiNNaker architecture does not have floating-point capability in hardware - it simulates floating-point operations in software. MPI_WTIME is therefore very expensive, and converting back from double to int64 is also very expensive. These new functions would be orders of magnitude cheaper to implement.

There are efforts to implement MPI on this architecture, e.g. see:
http://ieeexplore.ieee.org/document/8052322/

@jdinan
Copy link

jdinan commented Feb 28, 2018

@dholmes-epcc-ed-ac-uk If the argument for the new routine is strictly lower overhead, I suggest that someone measure the difference. gcc [1] supports software floating point emulation, which could allow you to measure that scenario as well.

[1] https://stackoverflow.com/questions/13201495/soft-float-on-x86-64

@jdinan
Copy link

jdinan commented Mar 1, 2018

I wrote a small program to measure the difference between integer and floating point return values for several common Linux timing methods [1]. I measured a difference of 15-19 cycles, which amounted to a roughly 15-19% increase in overhead to return a floating point versus an integer result:

           MPI_Wtime - 396.58 ns 107 cycles
               RDTSC -   7.24 ns  21 cycles
   clock_monotonic_f - 394.39 ns 100 cycles
   clock_monotonic_i - 297.75 ns  81 cycles
    clock_realtime_f - 304.46 ns 100 cycles
    clock_realtime_i - 299.28 ns  85 cycles
      gettimeofday_f - 397.69 ns 110 cycles
      gettimeofday_i - 301.14 ns  91 cycles

This was measured on an Intel(R) Xeon(R) CPU X5570 @ 2.93GHz, CentOS Linux release 7.3.1611, Intel MPI 2017.4.196, and compiled with gcc 4.8.5.

[1] https://gist.github.com/jdinan/227d1777798155b99d0fa995b750247b

@jdinan
Copy link

jdinan commented Mar 1, 2018

I updated the gist so that all timers use nsec and we convert to double seconds to integer nsec during the timed portion (the use case used to motivate this ticket -- I missed this conversion in previous measurements):

           MPI_Wtime - 399.77 ns 116 cycles
               RDTSC -   7.24 ns  21 cycles
   clock_monotonic_f - 396.19 ns 105 cycles
   clock_monotonic_i - 297.77 ns  81 cycles
    clock_realtime_f - 397.69 ns 110 cycles
    clock_realtime_i - 299.26 ns  85 cycles
      gettimeofday_f - 400.61 ns 118 cycles
      gettimeofday_i - 304.11 ns  99 cycles

This adds another ~5ns. There are a total of 6 FP operations (convert/scale sub-sec, convert sec, add sub-sec, scale/convert sec to nsec after wtime returns) that accounts for ~30ns total or roughly 5ns per operation. If you assume soft FP is 25x slower, that gives you an estimated 750ns + 80ns = 830ns to query time with soft FP (roughly one order of magnitude).

bosilca added a commit to bosilca/ompi that referenced this issue Mar 7, 2018
As discussed on mpi-forum/mpi-issues#77 (comment)
the conversion to double in the MPI_Wtime decrease the range
and accuracy of the resulting timer. By setting the timer to
0 at the first usage we basically maintain the accuracy for
194 days even for gettimeofday.

Signed-off-by: George Bosilca <[email protected]>
bosilca added a commit to bosilca/ompi that referenced this issue Mar 7, 2018
As discussed on mpi-forum/mpi-issues#77 (comment)
the conversion to double in the MPI_Wtime decrease the range
and accuracy of the resulting timer. By setting the timer to
0 at the first usage we basically maintain the accuracy for
194 days even for gettimeofday.

Signed-off-by: George Bosilca <[email protected]>
bosilca added a commit to bosilca/ompi that referenced this issue Mar 7, 2018
As discussed on mpi-forum/mpi-issues#77 (comment)
the conversion to double in the MPI_Wtime decrease the range
and accuracy of the resulting timer. By setting the timer to
0 at the first usage we basically maintain the accuracy for
194 days even for gettimeofday.

Signed-off-by: George Bosilca <[email protected]>
bosilca added a commit to bosilca/ompi that referenced this issue Mar 8, 2018
As discussed on mpi-forum/mpi-issues#77 (comment)
the conversion to double in the MPI_Wtime decrease the range
and accuracy of the resulting timer. By setting the timer to
0 at the first usage we basically maintain the accuracy for
194 days even for gettimeofday.

Signed-off-by: George Bosilca <[email protected]>
bosilca added a commit to bosilca/ompi that referenced this issue Mar 8, 2018
As discussed on mpi-forum/mpi-issues#77 (comment)
the conversion to double in the MPI_Wtime decrease the range
and accuracy of the resulting timer. By setting the timer to
0 at the first usage we basically maintain the accuracy for
194 days even for gettimeofday.

Signed-off-by: George Bosilca <[email protected]>
bosilca added a commit to bosilca/ompi that referenced this issue Mar 8, 2018
As discussed on mpi-forum/mpi-issues#77 (comment)
the conversion to double in the MPI_Wtime decrease the range
and accuracy of the resulting timer. By setting the timer to
0 at the first usage we basically maintain the accuracy for
194 days even for gettimeofday.

Signed-off-by: George Bosilca <[email protected]>
@mahermanns
Copy link
Member Author

Thanks everyone for the discussion on this. Also based on the feedback we got for the MPI_T events proposal, we now moved into the direction of having separate timing routines for MPI_T that are independent of MPI_Wtime and integrated those into the #79 directly. I therefore close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scheduled reading Reading is scheduled for the next meeting wg-tools Tools Working Group
Projects
None yet
Development

No branches or pull requests

5 participants