Skip to content

UCX (SW) RMA Atomics Performance #6868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
devreal opened this issue Aug 6, 2019 · 33 comments
Open

UCX (SW) RMA Atomics Performance #6868

devreal opened this issue Aug 6, 2019 · 33 comments

Comments

@devreal
Copy link
Contributor

devreal commented Aug 6, 2019

I'm in the process of benchmarking the latencies of different MPI RMA operations such as put, get, fetch-op, accumulate, etc. I'm using the Open MPI 4.0.x branch with the UCX 1.6.x branch on a IB cluster (mlx5). For reference, I'm including two additional runs: MVAPICH 2.3.1 (the same cluster as Open MPI 4.0.x) and Open MPI 3.1.2 (measurements from an older cluster with mlx3 devices using the openib btl; I'm unable to use the openib btl on the newer system).

mpi_progress

The benchmark measures 100k repetitions of the respective operation immediately followed by a flush. The origin and target processes run on different nodes, exclusively. Only one application thread is running per process and MPI is initialized without thread support. I used -bind-to socket to make sure there is no interference with any potential progress thread. The used window was locked exclusively at the target by the origin.

It strikes me that the latencies of accumulate operations are significantly higher than with both MVAPICH and the openib btl, by a factor of 4 for some operations (for example: an MPI_Accumulate takes 10us with UCX but only around 2.5us with MVAPICH and Open MPI 3.1.2).

Hence my questions:

  1. Are these high latencies expected? (My guess is that SW atomics are to blame as the system likely doesn't run the latest OFI stack but I'm not sure how to check whether SW atomics are used)
  2. Are there any knobs to tune in Open MPI / Open UCX to reduce the latencies?
  3. How can check which atomic implementation is used by UCX?

Open UCX was configured using:

../configure --prefix=$HOME/opt/ucx-1.6.x --enable-optimizations --disable-logging --disable-assertions --disable-params-check

Open MPI was configured using:

../configure CC=gcc CXX=g++ FC=gfortran --with-ucx=$HOME/opt/ucx-1.6.x/ --prefix=$HOME/opt/openmpi-v4.0.x-ucx --without-verbs --with-slurm --with-pmi=/usr --with-pmi-libdir=/usr/lib64

Anything I might be missing? Any hint would be greatly appreciated.

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

hi @devreal
could you copy here command lines how you run UCX benchmarks?

thank you

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

I don't pass any specific flags to mpirun except for bind-to:

~/opt/openmpi-v4.0.x-ucx/bin/mpirun -n 2 -bind-to socket ./mpi_progress_openmpi-v4.0.x-ucx

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

try to add flags -mca pml ucx -mca spml ucx -mca osc ucx -mca atomic ucx to force using UCX as transport

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

Thanks, I just ran again: the latency is the same in all cases. Just to make sure I'm actually using UCX, I also ran with --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 --mca btl_base_verbose 100 --mca osc_base_verbose 100 -mca osc_rdma_verbose 100:

[hostA:07142] mca: base: components_register: registering framework btl components
[hostA:07142] mca: base: components_register: found loaded component self
[hostA:07142] mca: base: components_register: component self register function successful
[hostA:07142] mca: base: components_register: found loaded component openib
[hostA:07142] mca: base: components_register: component openib register function successful
[hostA:07142] mca: base: components_register: found loaded component sm
[hostA:07142] mca: base: components_register: found loaded component tcp
[hostA:07142] mca: base: components_register: component tcp register function successful
[hostA:07142] mca: base: components_register: found loaded component uct
[hostA:07142] mca: base: components_register: component uct register function successful
[hostA:07142] mca: base: components_register: found loaded component vader
[hostA:07142] mca: base: components_register: component vader register function successful
[hostA:07142] mca: base: components_open: opening btl components
[hostA:07142] mca: base: components_open: found loaded component self
[hostA:07142] mca: base: components_open: component self open function successful
[hostA:07142] mca: base: components_open: found loaded component openib
[hostA:07142] mca: base: components_open: component openib open function successful
[hostA:07142] mca: base: components_open: found loaded component tcp
[hostA:07142] mca: base: components_open: component tcp open function successful
[hostA:07142] mca: base: components_open: found loaded component uct
[hostA:07142] mca: base: components_open: component uct open function successful
[hostA:07142] mca: base: components_open: found loaded component vader
[hostA:07142] mca: base: components_open: component vader open function successful
[hostA:07142] select: initializing btl component self
[hostA:07142] select: init of component self returned success
[hostA:07142] select: initializing btl component openib
[hostA:07142] Checking distance from this process to device=mlx5_0
[hostA:07142] Process is not bound: distance to device is 0.000000
[hostA:07142] select: init of component openib returned failure
[hostA:07142] mca: base: close: component openib closed
[hostA:07142] mca: base: close: unloading component openib
[hostA:07142] select: initializing btl component tcp
[hostA:07142] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[hostA:07142] btl: tcp: Found match: 127.0.0.1 (lo)
[hostA:07142] btl:tcp: Attempting to bind to AF_INET port 1024
[hostA:07142] btl:tcp: Successfully bound to AF_INET port 1024
[hostA:07142] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[hostA:07142] btl:tcp: examining interface enp7s0f0
[hostA:07142] btl:tcp: using ipv6 interface enp7s0f0
[hostA:07142] btl:tcp: examining interface ib0
[hostA:07142] btl:tcp: using ipv6 interface ib0
[hostA:07142] btl:tcp: examining interface ib0:0
[hostA:07142] btl:tcp: using ipv6 interface ib0:0
[hostA:07142] btl:tcp: examining interface virbr0
[hostA:07142] btl:tcp: using ipv6 interface virbr0
[hostA:07142] select: init of component tcp returned success
[hostA:07142] select: initializing btl component uct
[hostA:07142] select: init of component uct returned failure
[hostA:07142] mca: base: close: component uct closed
[hostA:07142] mca: base: close: unloading component uct
[hostA:07142] select: initializing btl component vader
[hostA:07142] select: init of component vader returned failure
[hostA:07142] mca: base: close: component vader closed
[hostA:07142] mca: base: close: unloading component vader
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:191 mca_pml_ucx_open
[hostA:07142] mca: base: components_register: registering framework osc components
[hostA:07142] mca: base: components_register: found loaded component ucx
[hostA:07142] mca: base: components_register: component ucx register function successful
[hostA:07142] mca: base: components_open: opening osc components
[hostA:07142] mca: base: components_open: found loaded component ucx
[hostA:07142] mca: base: components_open: component ucx open function successful
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:255 mca_pml_ucx_init
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:111 Pack remote worker address, size 159
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:111 Pack local worker address, size 340
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:310 created ucp context 0x73bfd0, worker 0x2b62f39e2010
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:179 Got proc 0 address, size 340
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:370 connecting to proc. 0
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:179 Got proc 1 address, size 159
[hostA:07142] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:370 connecting to proc. 1

Is the line select: init of component uct returned failure something I should worry about? I'm a bit confused because I also see mca: base: components_open: component ucx open function successful...

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

it means that btl uct component is failed to initialize, it is ok

what is your configuration? how many hosts, PPN, what HCA is used?

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

I'm running on a Haswell-based cluster with dual-socket nodes, one process per node, 2 nodes (so 2 processes), ConnectX-5 HCAs:

$ ~/opt/openmpi-v4.0.x-ucx/bin/mpirun -n 1 lspci | grep Mellanox
04:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
$ ~/opt/openmpi-v4.0.x-ucx/bin/mpirun -n 2 ibv_devices
    device          	   node GUID
    ------          	----------------
    mlx5_0          	08003800013c7507
    device          	   node GUID
    ------          	----------------
    mlx5_0          	08003800013c773b
$ rpm -q libibverbs
libibverbs-41mlnx1-OFED.4.2.0.1.4.42100.x86_64

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

could you provide output from ibv_devinfo or from ibstat on compute nodes?

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

Sure :)

$ ~/opt/openmpi-v4.0.x-ucx/bin/mpirun -n 1 -H hostA ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				10.16.1200
	node_guid:			0800:3800:013c:7507
	sys_image_guid:			0800:3800:013c:7507
	vendor_id:			0x119f
	vendor_part_id:			4113
	hw_ver:				0x0
	board_id:			BL_12001158CIBF1
	phys_port_cnt:			1
	Device ports:
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			2580
			port_lid:		311
			port_lmc:		0x00
			link_layer:		InfiniBand
$ ~/opt/openmpi-v4.0.x-ucx/bin/mpirun -n 1 -H hostB ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				10.16.1200
	node_guid:			0800:3800:013c:773b
	sys_image_guid:			0800:3800:013c:773b
	vendor_id:			0x119f
	vendor_part_id:			4113
	hw_ver:				0x0
	board_id:			BL_12001158CIBF1
	phys_port_cnt:			1
	Device ports:
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			2580
			port_lid:		303
			port_lmc:		0x00
			link_layer:		InfiniBand

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

hmmm, vendor_id: 0x119f means that you use Bull device, but in lspci we see that there is Mellanox controller...

but ok, can you add command line parameters --display-map --map-by node --bind-to core to be sure that you run not at the same node & bind to cores on different nodes?

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

Ahh yes, it's a Bull machine.

$ ~/opt/openmpi-v4.0.x-ucx/bin/mpirun --display-map --map-by node --bind-to socket -n 2 -bind-to socket  -mca pml ucx -mca spml ucx -mca osc ucx -mca atomic ucx ./mpi_progress_openmpi-v4.0.x-ucx
  Data for JOB [58021,1] offset 0 Total slots allocated 2

 ========================   JOB MAP   ========================

 Data for node: taurusi6606	Num slots: 1	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [58021,1] App: 0 Process rank: 0 Bound: UNBOUND

 Data for node: taurusi6607	Num slots: 1	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [58021,1] App: 0 Process rank: 1 Bound: N/A

 =============================================================

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

does --bind-to core affects to bench results?

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

I do not see a difference in the latencies with --bind-to core instead of --bind-to socket, and the map display stays the same, too (I would expect a binding to be reported in that case...):

 Data for JOB [57924,1] offset 0 Total slots allocated 2

 ========================   JOB MAP   ========================

 Data for node: taurusi6606	Num slots: 1	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [57924,1] App: 0 Process rank: 0 Bound: UNBOUND

 Data for node: taurusi6607	Num slots: 1	Max slots: 0	Num procs: 1
 	Process OMPI jobid: [57924,1] App: 0 Process rank: 1 Bound: N/A

 =============================================================

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

I see

ok, did you built UCX on compute node, right? could you provide output from commands:

ucx_info -d

and

ucx_info -pwe -D net -u t

thank you

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

did you built UCX on compute node, right?

I built everything on the login nodes as they have the same CPUs as the compute nodes. Would configuring UCX on the compute nodes impact the result of configure?

$ ~/opt/openmpi-v4.0.x-ucx/bin/mpirun -n 1 ~/opt/ucx-1.6.x/bin/ucx_info -d
#
# Memory domain: posix
#            component: posix
#             allocate: unlimited
#           remote key: 45 bytes
#
#   Transport: mm
#
#   Device: posix
#
#      capabilities:
#            bandwidth: 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 92
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
#
# Memory domain: sysv
#            component: sysv
#             allocate: unlimited
#           remote key: 40 bytes
#
#   Transport: mm
#
#   Device: sysv
#
#      capabilities:
#            bandwidth: 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 92
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
#
# Memory domain: self
#            component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 16 bytes
#
#   Transport: self
#
#   Device: self
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#             priority: 0
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: tcp
#            component: tcp
#
#   Transport: tcp
#
#   Device: virbr0
#
#      capabilities:
#            bandwidth: 11.32 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#             am_short: <= 8K
#             am_bcopy: <= 8K
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#   Device: ib0
#
#      capabilities:
#            bandwidth: 6661.48 MB/sec
#              latency: 5210 nsec
#             overhead: 50000 nsec
#             am_short: <= 8K
#             am_bcopy: <= 8K
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#   Device: enp7s0f0
#
#      capabilities:
#            bandwidth: 113.16 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#             am_short: <= 8K
#             am_bcopy: <= 8K
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Memory domain: ib/mlx5_0
#            component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 24 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 700 nsec + 1 * N
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8K
#            put_zcopy: <= 1G, up to 8 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8K
#            get_zcopy: 65..1G, up to 8 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8191
#             am_zcopy: <= 8191, up to 7 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#             priority: 0
#       device address: 3 bytes
#           ep address: 4 bytes
#       error handling: peer failure
#
#
#   Transport: ud
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 710 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 7 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3984
#           connection: to ep, to iface
#             priority: 0
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
#   Transport: cm
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 5088.78 MB/sec
#              latency: 700 nsec
#             overhead: 1200 nsec
#             am_bcopy: <= 214
#           connection: to iface
#             priority: 0
#       device address: 3 bytes
#        iface address: 4 bytes
#       error handling: none
#
#
# Memory domain: rdmacm
#            component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: cma
#            component: cma
#             register: unlimited, cost: 9 nsec
#
#   Transport: cma
#
#   Device: cma
#
#      capabilities:
#            bandwidth: 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#
#
# Memory domain: knem
#            component: knem
#             register: unlimited, cost: 90 nsec
#           remote key: 32 bytes
#
#   Transport: knem
#
#   Device: knem
#
#      capabilities:
#            bandwidth: 13862.00 MB/sec
#              latency: 80 nsec
#             overhead: 250 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 0 bytes
#       error handling: none
#

and

$ ~/opt/openmpi-v4.0.x-ucx/bin/mpirun -n 1 ~/opt/ucx-1.6.x/bin/ucx_info -pwe -D net -u t
#
# UCP context
#
#            md 0  :  tcp
#            md 1  :  ib/mlx5_0
#            md 2  :  rdmacm
#
#      resource 0  :  md 0  dev 0  flags -- tcp/virbr0
#      resource 1  :  md 0  dev 1  flags -- tcp/ib0
#      resource 2  :  md 0  dev 2  flags -- tcp/enp7s0f0
#      resource 3  :  md 1  dev 3  flags -- rc/mlx5_0:1
#      resource 4  :  md 1  dev 3  flags -- ud/mlx5_0:1
#      resource 5  :  md 1  dev 3  flags -- cm/mlx5_0:1
#      resource 6  :  md 2  dev 4  flags -s rdmacm/sockaddr
#
# memory: 0.00MB, file descriptors: 2
# create time: 19.571 ms
#
#
# UCP worker 'taurusi6606:10008'
#
#                 address: 159 bytes
#
# memory: 3.54MB, file descriptors: 14
# create time: 33.342 ms
#
#
# UCP endpoint
#
#               peer: <no debug data>
#                 lane[0]:  3:rc/mlx5_0:1 md[1]           -> md[1] rma_bw#0 am am_bw#0 wireup{ud/mlx5_0:1}
#
#                tag_send: 0..<egr/short>..116..<egr/bcopy>..5560..<egr/zcopy>..165440..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..116..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..116..<egr/bcopy>..5560..<egr/zcopy>..165440..<rndv>..(inf)
#
#                  rma_bw: mds [1] rndv_rkey_size 34
#

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

as I can see all required capabilities present on HCA and UCX select appropriate transport.
is it possible to look at your benchmark? we may try to reproduce issue on our environment

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

I will put it in my github repo and post and link soon :)

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

I looked one more time on your logs - it seems your device doesn't support all required atomic bitwise operations. it seems you are right - UCX uses SW emulation for atomic ops (atomic_add/or/xor/etc).

is it possible to install newer MOFED and test app?

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

is it possible to install newer MOFED and test app?

Since this is a production system I don't think I can convince the admins to upgrade in the near future (it's also a system on a different site, which doesn't make my case more convincing ^^). I will file an issue and see what they say. What is the minimum required version for HW atomics? And what is the indicator saying that these operations are not supported in HW?

Is the latency I'm seeing what is to be expected for SW atomics? (10us for a single element accumulate seems pretty high to me...)

@devreal
Copy link
Contributor Author

devreal commented Aug 6, 2019

Here is the benchmark: https://github.com/devreal/mpi-progress

It's not pretty but does the job ^^ I will probably do a rewrite at some point to get of the macros...

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 6, 2019

we are using MLNX_OFED 4.6
as I remember to check if MOFED has support of required ops run

ibv_devinfo -v | grep -E 'EXT_ATOMICS|EXP_MASKED_ATOMICS'

both atomics should be there

about latency - fetch-and-op operation require processing on remote side to be completed, it means that remote side should be in UCX stack when request arrived to process it. here issue could be not in network speed, but in benchmark itself: how often it calls worker progress. HW implementation process remote request without involving CPU at all and 4x difference is possible

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 7, 2019

hi @devreal
could you set variable

UCX_IB_DEVICE_SPECS=0x119f:4113:ConnectIB:5

and run

ucx_info -d

and post output here, or look at bench performance changes?

thank you

@devreal
Copy link
Contributor Author

devreal commented Aug 7, 2019

ibv_devinfo -v | grep -E 'EXT_ATOMICS|EXP_MASKED_ATOMICS'

both atomics should be there

I can see both flags on this system:

$ srun -n 1 ibv_devinfo -v
[...]
        device_cap_exp_flags:           0x504840F100000000
                                        EXP_DC_TRANSPORT
                                        EXP_CROSS_CHANNEL
                                        EXP_MR_ALLOCATE
                                        EXT_ATOMICS
                                        EXT_SEND NOP
                                        EXP_UMR
                                        EXP_ODP
                                        EXP_DC_INFO
                                        EXP_MASKED_ATOMICS
                                        EXP_PHYSICAL_RANGE_MR
[...]

Maybe that just shows the capabilities of the HCA, not the caps supported by the MOFED software stack?

about latency - fetch-and-op operation require processing on remote side to be completed, it means that remote side should be in UCX stack when request arrived to process it. here issue could be not in network speed, but in benchmark itself: how often it calls worker progress. HW implementation process remote request without involving CPU at all and 4x difference is possible

The numbers that I reported are measured with the target immediately entering an MPI_Barrier, waiting for the origin to complete its operations. My understanding is that this should constantly call into the UCX stack to trigger progress.

UCX_IB_DEVICE_SPECS=0x119f:4113:ConnectIB:5

$ export UCX_IB_DEVICE_SPECS=0x119f:4113:ConnectIB:5
$ srun -n 1 ucx_info -d
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
#
# Memory domain: tcp
#            component: tcp
#
#   Transport: tcp
#
#   Device: virbr0
#
#      capabilities:
#            bandwidth: 11.32 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#   Device: ib0
#
#      capabilities:
#            bandwidth: 6671.64 MB/sec
#              latency: 5210 nsec
#             overhead: 50000 nsec
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#   Device: enp7s0f0
#
#      capabilities:
#            bandwidth: 113.16 MB/sec
#              latency: 5776 nsec
#             overhead: 50000 nsec
#           connection: to iface
#             priority: 1
#       device address: 4 bytes
#        iface address: 2 bytes
#       error handling: none
#
#
# Memory domain: ib/mlx5_0
#            component: ib
#             register: unlimited, cost: 90 nsec
#           remote key: 16 bytes
#           local memory handle is required for zcopy
#
#   Transport: rc
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 700 nsec + 3 * N
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8k
#            put_zcopy: <= 1g, up to 8 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4k
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4k
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4k
#            get_bcopy: <= 8k
#            get_zcopy: 65..1g, up to 8 iov
#             am_short: <= 123
#             am_bcopy: <= 8191
#             am_zcopy: <= 8191, up to 7 iov
#            am header: <= 127
#           atomic_add: 32, 64 bit, device
#          atomic_fadd: 32, 64 bit, device
#          atomic_swap: 32, 64 bit, device
#         atomic_cswap: 32, 64 bit, device
#           connection: to ep
#             priority: 0
#       device address: 3 bytes
#           ep address: 4 bytes
#       error handling: peer failure
#
#
#   Transport: rc_mlx5
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 700 nsec + 3 * N
#             overhead: 40 nsec
#            put_short: <= 220
#            put_bcopy: <= 8k
#            put_zcopy: <= 1g, up to 8 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4k
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4k
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4k
#            get_bcopy: <= 8k
#            get_zcopy: 65..1g, up to 8 iov
#             am_short: <= 235
#             am_bcopy: <= 8191
#             am_zcopy: <= 8191, up to 3 iov
#            am header: <= 187
#           atomic_add: 32, 64 bit, device
#          atomic_fadd: 32, 64 bit, device
#          atomic_swap: 32, 64 bit, device
#         atomic_cswap: 32, 64 bit, device
#           connection: to ep
#             priority: 0
#       device address: 3 bytes
#           ep address: 4 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#   Transport: dc
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 760 nsec
#             overhead: 75 nsec
#            put_short: <= 108
#            put_bcopy: <= 8k
#            put_zcopy: <= 1g, up to 7 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4k
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4k
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4k
#            get_bcopy: <= 8k
#            get_zcopy: 65..1g, up to 7 iov
#             am_short: <= 107
#             am_bcopy: <= 8191
#             am_zcopy: <= 8191, up to 6 iov
#            am header: <= 127
#           atomic_add: 32, 64 bit, device
#          atomic_fadd: 32, 64 bit, device
#          atomic_swap: 32, 64 bit, device
#         atomic_cswap: 32, 64 bit, device
#           connection: to iface
#             priority: 0
#       device address: 3 bytes
#        iface address: 5 bytes
#       error handling: peer failure
#
#
#   Transport: dc_mlx5
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 760 nsec
#             overhead: 40 nsec
#            put_short: <= 172
#            put_bcopy: <= 8k
#            put_zcopy: <= 1g, up to 8 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4k
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4k
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4k
#            get_bcopy: <= 8k
#            get_zcopy: 65..1g, up to 8 iov
#             am_short: <= 187
#             am_bcopy: <= 8191
#             am_zcopy: <= 8191, up to 3 iov
#            am header: <= 139
#           atomic_add: 32, 64 bit, device
#          atomic_fadd: 32, 64 bit, device
#          atomic_swap: 32, 64 bit, device
#         atomic_cswap: 32, 64 bit, device
#           connection: to iface
#             priority: 0
#       device address: 3 bytes
#        iface address: 4 bytes
#       error handling: buffer (zcopy), remote access, peer failure
#
#
#   Transport: ud
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 700 nsec
#             overhead: 80 nsec
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4k
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 7 iov
#            am header: <= 116
#           connection: to ep, to iface
#             priority: 0
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
#   Transport: ud_mlx5
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 6433.22 MB/sec
#              latency: 700 nsec
#             overhead: 80 nsec
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4k
#             am_short: <= 132
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#            am header: <= 132
#           connection: to ep, to iface
#             priority: 0
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure
#
#
#   Transport: cm
#
#   Device: mlx5_0:1
#
#      capabilities:
#            bandwidth: 5088.78 MB/sec
#              latency: 700 nsec
#             overhead: 1200 nsec
#             am_bcopy: <= 214
#           connection: to iface
#             priority: 0
#       device address: 3 bytes
#        iface address: 4 bytes
#       error handling: none
#
#
# Memory domain: rdmacm
#            component: rdmacm
#           supports client-server connection establishment via sockaddr
#   < no supported devices found >
#
# Memory domain: gpu
#            component: gpu
#             register: unlimited, cost: 0 nsec
#
#   Transport: cuda
#
#   Device: gpu0
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 1 nsec
#             overhead: 0 nsec
#           connection: none
#             priority: 0
#       device address: 0 bytes
#       error handling: none
#
#
# Memory domain: sysv
#            component: sysv
#             allocate: unlimited
#           remote key: 32 bytes
#
#   Transport: mm
#
#   Device: sysv
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 92
#             am_bcopy: <= 8k
#           atomic_add: 32, 64 bit, cpu
#          atomic_fadd: 32, 64 bit, cpu
#          atomic_swap: 32, 64 bit, cpu
#         atomic_cswap: 32, 64 bit, cpu
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
#
# Memory domain: posix
#            component: posix
#             allocate: unlimited
#           remote key: 37 bytes
#
#   Transport: mm
#
#   Device: posix
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 92
#             am_bcopy: <= 8k
#           atomic_add: 32, 64 bit, cpu
#          atomic_fadd: 32, 64 bit, cpu
#          atomic_swap: 32, 64 bit, cpu
#         atomic_cswap: 32, 64 bit, cpu
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
#
# Memory domain: cma
#            component: cma
#             register: unlimited, cost: 9 nsec
#
#   Transport: cma
#
#   Device: cma
#
#      capabilities:
#            bandwidth: 12000.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#
#
# Memory domain: knem
#            component: knem
#             register: unlimited, cost: 1200+(0.007*<SIZE>) nsec
#           remote key: 24 bytes
#
#   Transport: knem
#
#   Device: knem
#
#      capabilities:
#            bandwidth: 12000.00 MB/sec
#              latency: 80 nsec
#             overhead: 250 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#           connection: to iface
#             priority: 0
#       device address: 8 bytes
#        iface address: 0 bytes
#       error handling: none
#
#
# Memory domain: self
#            component: self
#             register: unlimited, cost: 0 nsec
#
#   Transport: self
#
#   Device: self
#
#      capabilities:
#            bandwidth: 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8k
#             am_bcopy: <= 8k
#           atomic_add: 32, 64 bit, cpu
#          atomic_fadd: 32, 64 bit, cpu
#          atomic_swap: 32, 64 bit, cpu
#         atomic_cswap: 32, 64 bit, cpu
#           connection: to iface
#             priority: 0
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#

With the UCX_IB_DEVICE_SPECS environment variable set as you suggested the latency looks better:

mpi_progress

It looks like the atomic operations are now performed in hardware as they progress even if the target process is not active in MPI. The latency for accumulate and compare_and_swap still appears higher than with the openib btl on the older system but having hardware support is valuable already.

@yosefe
Copy link
Contributor

yosefe commented Aug 7, 2019

Thanks @devreal
@Marek77 (Bull) - probably need to add more Bull HCA ID's to UCX table?
@hoopoepg can we repro the accumulate latency locally, and check where cpu % is spent? maybe it's in osc/ucx component

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 7, 2019

yep, will try to reproduce on our environment. it will take a time
@devreal let me know if you going to update benchmark

thank you

@devreal
Copy link
Contributor Author

devreal commented Aug 7, 2019

Thank you @hoopoepg for looking into this. I have pushed a small fix this morning, it should be stable now. Let me know if you have any questions.

@devreal
Copy link
Contributor Author

devreal commented Aug 8, 2019

@hoopoepg fyi, I see the same latencies with the OSU benchmarks, in case that is easier for you to use than my benchmark :)

@hoopoepg
Copy link
Contributor

hoopoepg commented Aug 8, 2019

ok, thank you for update. will try both benchs

@devreal
Copy link
Contributor Author

devreal commented Aug 23, 2019

@hoopoepg Any update on this issue?

In my benchmark, I am observing significantly increasing latencies if multiple processes update the same variables. Is it possible that the atomic updates are emulated using CAS in my case? If so, how can I find out if that is the case?

On another note: With Open MPI's SHMEM (and UCX spml), latencies are what I would expect on the Bull cluster (~2us for shmem_atomic_fetch_add for example).

@devreal
Copy link
Contributor Author

devreal commented Aug 26, 2019

Ahhh, I should have taken a look at the code earlier... The UCX osc component performs a lock/unlock on the window for each accumulate operation, making it at least 3 atomic operations per MPI accumulate/fetch-op operation. This seems the case in both the v4.0.x branch and master. It is my understanding that this is due to MPI_Accumulate being allowed on user-defined data types. Thus, an accumulate in the UCX component requires 4 network operations (lock, get, put, unlock), hence the high latency in MPI_Accumulate.

Observations/suggestions:

  1. If the process already owns an exclusive lock of the window at the target there should be no need for the accumulate lock.
  2. The osc rdma component supports the flag acc_single_intrinsic that lets the user guarantee that only a single intrinsic datatype is used and that it is thus safe to rely solely on network atomics (potentially using CAS for some operations). The measurements presented for Open MPI 3.1.2 above have this flag enabled. It has proven invaluable for me in the past so I will look into implementing something similar for the UCX osc component.
  3. I don't see how the actual accumulate operation is ordered with respect to the release of the accumulate lock (end_atomicity) as there is no waiting for completion. Is that correct in all cases?

@hoopoepg
Copy link
Contributor

hi @devreal
sorry for late response

we reproduced issue (unfortunately we haven't same HW and used another HCA). so, we found same bottlenecks: start/stop atomic sessions requires remote compare-and-swap (CAS) operation to obtain exclusive lock which adds performance penalty on small portions of data

about your suggestions: 1 & 2 could work, will look at it later, but I didn't get your point in item 3: there is opal_common_ucx_ep_flush call after all successful *_put cals which guarantee completion of all operation prior to end_atomicity is finished. did I missed something?

@devreal
Copy link
Contributor Author

devreal commented Aug 26, 2019

Thanks for the response :) I'm working on 2, will post a PR some time this week if I can manage to get it in good shape. I can confirm that if I avoid the locks the performance gets much better. 1) should be an easy fix as well.

Regarding 3): I see that the start_atomicity and end_atomicity calls are implemented using opal_common_ucx_atomic_fetch, which in turn calls ucp_atomic_fetch_nb and opal_common_ucx_wait_request. In ompi_osc_ucx_fetch_and_op, the accumulate lock is taken, an operation is started using ucp_atomic_fetch_nb, the request is released, no flush is called, and the accumulate lock is released. I guess it's not clear to me whether the opal_common_ucx_wait_request is guaranteed to complete the previous atomic_fetch as well. I couldn't really figure that out from looking at the documentation of UCX...

@hoopoepg
Copy link
Contributor

it is ok - Infiniband guarantee ordering of operations which is used in UCX and we may release request even in case if operation is not completed (all operations scheduled after fetch will be completed after fetch is completed)

@devreal
Copy link
Contributor Author

devreal commented Aug 26, 2019

I wonder whether that is specified in the UCX standard (UCX is supposed to cover more than just Infiniband iirc). Looking at the UCX 1.6 document I cannot find anything about operation ordering except for the description of uct_ep_fence, nothing on default ordering behavior. Is that an (undocumented) feature of UCX or an assumption that only holds on some architectures?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants