Pcluster 3.1.4 - network congestion for large scale jobs

We started an HPC cluster and placed the HeadNode into a public subnet for easy access from the internet while the compute fleet sits on a private subnet.

While we had a limited number of p4d instances available things were running normal but we recently got new capacity reservation and node count went up to 194, allowing us to try large jobs with over 100 nodes / 800 GPUs with EFA enables, these instances having 4 100Gbps network interfaces each (so > 40000 Gbps per job).

We noticed jobs failure due to network errors both in headnode-computenode operations as well as compute-to-compute comms.

Especially openmpi seems sensitive to this, it fails to launch batch jobs for more than 35 nodes. Intelmpi seems more resilient, we were able to run 80 nodes jobs but nothing above that. We also had an event when slurm master lost connection to the entire fleet, sending all compute down and rescheduling all jobs.

I want to get some network recommendations for this scenario. What I am thinking now is to move the headnode in the same private subnet as compute fleet (and provide SSH connectivity via tunneling).

Is there anything that we could do at the level of VPC/subnet configurations to ease up the congestion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pcluster 3.1.4 - network congestion for large scale jobs #4179

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pcluster 3.1.4 - network congestion for large scale jobs #4179

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions