You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While implementing the cdesc stuff for the use mpi_f08 bindings, I noted some surprising performance numbers.
This can be evidenced with use mpi (e.g. no need for a modern Fortran 2008 compiler) with the code below
TL;DR do a pingpong with the buf(1:1048576:2) subarray, and compare standard Fortran (it will automagically perform a copy under the hood) and the (manual) use of ddt.
to my surprise, copying boosts the performance by a factor 2 !
I digged this, and found that the bottleneck comes from opal_generic_simple_[un]pack_function(), when [UN]PACK_PREDEFINED_DATATYPE() calls MEMCPY_CSUM() with size=4
If I replace memcpy(dest, src, 4) with *(int *)dest = *(int *)src, then I get much better performances (more than a 4x improvement).
The inline patch below was used. Note that even if it is big and only support OPAL_INT4, macros could be used to support the other predefined datatypes. it could also be improved to support more datatypes (for example MPI_Type_vector(..., 2, 4, MPI_INT, ...))
Please let me know if and how I should move forward.
Incidentally, I noted CONVERTOR_WITH_CHECKSUM is tested but never set, so it looks like dead code to me.
Shall I do something about it ? If so, what do you recommend
simply remove the dead code
#ifdef out the dead code (we might need it later)
add some OPAL_UNLIKELY() around the flags & CONVERTOR_WITH_CHECKSUM tests
While implementing the
cdesc
stuff for theuse mpi_f08
bindings, I noted some surprising performance numbers.This can be evidenced with
use mpi
(e.g. no need for a modern Fortran 2008 compiler) with the code belowTL;DR do a pingpong with the
buf(1:1048576:2)
subarray, and compare standard Fortran (it will automagically perform a copy under the hood) and the (manual) use of ddt.to my surprise, copying boosts the performance by a factor 2 !
I digged this, and found that the bottleneck comes from
opal_generic_simple_[un]pack_function()
, when[UN]PACK_PREDEFINED_DATATYPE()
callsMEMCPY_CSUM()
withsize=4
If I replace
memcpy(dest, src, 4)
with*(int *)dest = *(int *)src
, then I get much better performances (more than a 4x improvement).The inline patch below was used. Note that even if it is big and only support
OPAL_INT4
, macros could be used to support the other predefined datatypes. it could also be improved to support more datatypes (for exampleMPI_Type_vector(..., 2, 4, MPI_INT, ...)
)Please let me know if and how I should move forward.
Incidentally, I noted
CONVERTOR_WITH_CHECKSUM
is tested but never set, so it looks like dead code to me.Shall I do something about it ? If so, what do you recommend
#ifdef
out the dead code (we might need it later)OPAL_UNLIKELY()
around theflags & CONVERTOR_WITH_CHECKSUM
testsThe text was updated successfully, but these errors were encountered: