-
Notifications
You must be signed in to change notification settings - Fork 900
Random freezes of Infiniband #10432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@open-mpi/ucx FYI |
@robertsawko Which IB Device is present on your system? |
Thanks for responding so quickly!
Is this what you are asking?
```
ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.24.1000
Hardware version: 0
Node GUID: 0x248a07030091fde0
System image GUID: 0x248a07030091fde0
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 53
LMC: 0
SM lid: 17
Capability mask: 0x2651e848
Port GUID: 0x248a07030091fde0
Link layer: InfiniBand
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.24.1000
node_guid: 248a:0703:0091:fde0
sys_image_guid: 248a:0703:0091:fde0
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 17
port_lid: 53
port_lmc: 0x00
link_layer: InfiniBand
```
|
Thanks, what if you specify What do you mean by random freezes? Does this mean it happens sporadically, or it simply doesn't go past the first send/recv message? If the above runtime parameters don't help, what's the backtrace on both ranks when it freezes? |
@janjust is right, according to your logs, UCX PML disqualify itself because the list of transports was empty. |
@robertsawko what is the output of |
Hi! Thanks again to everyone for all their commitment and responding over the weekend too. ls -l /sys/class/infiniband/mlx5_0/device/driver
lrwxrwxrwx. 1 root root 0 May 8 22:08 /sys/class/infiniband/mlx5_0/device/driver -> ../../../../bus/pci/drivers/mlx5_core Also, I am using 4.1.1 stable. But I am happy to recompile with the commit you specified. @janjust you are right, when I specify:
the benchmark runs like a sprint runner on the last 10m of the final day of an Olympic competition with a fighting chance of breaking a world record... Sorry. So is that something that I need to specify? Maybe include in the Lmod file? Why is that list empty? |
@yosefe, I can confirm that the problem is actually fixed with 4.1.x - I no longer need to specify the variable and the |
Hello, I would appreciate some advice on the following issue.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI 4.1.1 and UCX 1.12.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OpenMPI and UCX were installed from sources:
Please describe the system on which you are running
Details of the problem
I am having issues at MPI initialisation stage. As a sanity check I started running Intel MPI Benchmark
The code simply freezes when we reach the actual benchmark. Forcing TCP makes it work which makes me think it's either a hardware problem or still some issue in my setup.
I've used
OMPI_MCA_pml_ucx_verbose=100
following a similar problem I was also having before]and here is the output for just two processes:The text was updated successfully, but these errors were encountered: