-
Notifications
You must be signed in to change notification settings - Fork 900
UCX (SW) RMA Atomics Performance #6868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hi @devreal thank you |
I don't pass any specific flags to
|
try to add flags |
Thanks, I just ran again: the latency is the same in all cases. Just to make sure I'm actually using UCX, I also ran with
Is the line |
it means that what is your configuration? how many hosts, PPN, what HCA is used? |
I'm running on a Haswell-based cluster with dual-socket nodes, one process per node, 2 nodes (so 2 processes), ConnectX-5 HCAs:
|
could you provide output from |
Sure :)
|
hmmm, but ok, can you add command line parameters |
Ahh yes, it's a Bull machine.
|
does |
I do not see a difference in the latencies with
|
I see ok, did you built UCX on compute node, right? could you provide output from commands:
and
thank you |
I built everything on the login nodes as they have the same CPUs as the compute nodes. Would configuring UCX on the compute nodes impact the result of
and
|
as I can see all required capabilities present on HCA and UCX select appropriate transport. |
I will put it in my github repo and post and link soon :) |
I looked one more time on your logs - it seems your device doesn't support all required atomic bitwise operations. it seems you are right - UCX uses SW emulation for atomic ops (atomic_add/or/xor/etc). is it possible to install newer MOFED and test app? |
Since this is a production system I don't think I can convince the admins to upgrade in the near future (it's also a system on a different site, which doesn't make my case more convincing ^^). I will file an issue and see what they say. What is the minimum required version for HW atomics? And what is the indicator saying that these operations are not supported in HW? Is the latency I'm seeing what is to be expected for SW atomics? (10us for a single element |
Here is the benchmark: https://github.com/devreal/mpi-progress It's not pretty but does the job ^^ I will probably do a rewrite at some point to get of the macros... |
we are using MLNX_OFED 4.6
both atomics should be there about latency - fetch-and-op operation require processing on remote side to be completed, it means that remote side should be in UCX stack when request arrived to process it. here issue could be not in network speed, but in benchmark itself: how often it calls worker progress. HW implementation process remote request without involving CPU at all and 4x difference is possible |
hi @devreal
and run
and post output here, or look at bench performance changes? thank you |
I can see both flags on this system:
Maybe that just shows the capabilities of the HCA, not the caps supported by the MOFED software stack?
The numbers that I reported are measured with the target immediately entering an
With the It looks like the atomic operations are now performed in hardware as they progress even if the target process is not active in MPI. The latency for |
yep, will try to reproduce on our environment. it will take a time thank you |
Thank you @hoopoepg for looking into this. I have pushed a small fix this morning, it should be stable now. Let me know if you have any questions. |
@hoopoepg fyi, I see the same latencies with the OSU benchmarks, in case that is easier for you to use than my benchmark :) |
ok, thank you for update. will try both benchs |
@hoopoepg Any update on this issue? In my benchmark, I am observing significantly increasing latencies if multiple processes update the same variables. Is it possible that the atomic updates are emulated using CAS in my case? If so, how can I find out if that is the case? On another note: With Open MPI's SHMEM (and UCX spml), latencies are what I would expect on the Bull cluster (~2us for |
Ahhh, I should have taken a look at the code earlier... The UCX osc component performs a lock/unlock on the window for each accumulate operation, making it at least 3 atomic operations per MPI accumulate/fetch-op operation. This seems the case in both the Observations/suggestions:
|
hi @devreal we reproduced issue (unfortunately we haven't same HW and used another HCA). so, we found same bottlenecks: start/stop atomic sessions requires remote compare-and-swap (CAS) operation to obtain exclusive lock which adds performance penalty on small portions of data about your suggestions: 1 & 2 could work, will look at it later, but I didn't get your point in item 3: there is opal_common_ucx_ep_flush call after all successful *_put cals which guarantee completion of all operation prior to end_atomicity is finished. did I missed something? |
Thanks for the response :) I'm working on 2, will post a PR some time this week if I can manage to get it in good shape. I can confirm that if I avoid the locks the performance gets much better. 1) should be an easy fix as well. Regarding 3): I see that the |
it is ok - Infiniband guarantee ordering of operations which is used in UCX and we may release request even in case if operation is not completed (all operations scheduled after fetch will be completed after fetch is completed) |
I wonder whether that is specified in the UCX standard (UCX is supposed to cover more than just Infiniband iirc). Looking at the UCX 1.6 document I cannot find anything about operation ordering except for the description of |
I'm in the process of benchmarking the latencies of different MPI RMA operations such as
put
,get
,fetch-op
,accumulate
, etc. I'm using the Open MPI 4.0.x branch with the UCX 1.6.x branch on a IB cluster (mlx5). For reference, I'm including two additional runs: MVAPICH 2.3.1 (the same cluster as Open MPI 4.0.x) and Open MPI 3.1.2 (measurements from an older cluster with mlx3 devices using theopenib
btl; I'm unable to use theopenib
btl on the newer system).The benchmark measures 100k repetitions of the respective operation immediately followed by a flush. The origin and target processes run on different nodes, exclusively. Only one application thread is running per process and MPI is initialized without thread support. I used
-bind-to socket
to make sure there is no interference with any potential progress thread. The used window was locked exclusively at the target by the origin.It strikes me that the latencies of accumulate operations are significantly higher than with both MVAPICH and the
openib
btl, by a factor of 4 for some operations (for example: anMPI_Accumulate
takes 10us with UCX but only around 2.5us with MVAPICH and Open MPI 3.1.2).Hence my questions:
Open UCX was configured using:
Open MPI was configured using:
Anything I might be missing? Any hint would be greatly appreciated.
The text was updated successfully, but these errors were encountered: