-
Notifications
You must be signed in to change notification settings - Fork 314
Pcluster 3.1.4 - network congestion for large scale jobs #4179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As most problems happen between the head node and compute fleet communication, I need however to mention that we use 3 targeted capacity reservations for p4ds. They are all in us-east-1d but I wonder since they are separated, combining them could be a cause for network congestion? Maybe we need to use separate clusters, one for each capacity reservation? |
Hi @rvencu I don't think it's a problem due to the different subnets usage, but probably related to the instance type you are using for the head node. Can you share your config? I'd like to see which instance type you're using for the Head Node and the shared storages you have configured. I'd suggest to start reading Best Practices, then the selection of the instance type really depends by the application/jobs you're submitting. Do you have EFA enabled for the compute instances? This is important to minimize latencies between instances. You can also verify if there are dropped packages by executing Then, if your application is performing continuous I/O in the shared folders, I'd suggest to use a FSx storage rather than using an EBS, because EBS are shared through NFS from the head node. |
Hi. Config here https://github.com/rvencu/1click-hpc/blob/main/parallelcluster/config.us-east-1.sample.yaml We are using FSx for Lustre but it is true some people might launch their workloads from the /home which is the EBS volume Headnode and compute are all in us-east-1d We experience the problems in compute-od-gpu partition_ |
ethtool -S eth0 | grep exceeded |
Thanks for the configuration. For compute-to-compute communication I see you're already using a Placement Group and you have EFA enabled so this should be enough to have good performances. Anyway the Your instance type I think this change, with the change of avoid using Enrico |
Thank you for the suggestions. Right now I cannot deploy a new cluster because the missing pip dependency at the installation of parallelcluster 3.1.4. But as soon as this is sorted out I will deploy a new version with these improvements |
For the pip dependency issue I'd strongly suggest to use a Python virtual environment for the installation, in this way all the deps required by ParallelCluster will stay isolated in that virtual environment and there will be no conflicts with the Python packages already installed in the instances. Anyway for this other thread we can keep the discussion open on the other thread. I'm going to close this issue. Feel free to open a new one or add more comments if other info are needed. Enrico |
Bumped the headnode to max hopefully there will be no more congestion, will monitor that However openmpi still has issues in compute-to-compute connections when I use more than 36 nodes or so as in
headnode upgrade did not help with this. intelmpi however seems ok |
switched to 100Gbps card and running nccl-test script from /fsx we still get this
|
Hi @rvencu if you're using FSx let me share some other FSx optimization best practices. The bw_out_allowance_exceeded describes the number of packets queued or dropped because the outbound aggregate bandwidth exceeded the maximum for the instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html Then you said you're running NCCL tests, did you follow official instructions ? |
Regarding your p4d usage, I am not sure that EFA is being used properly for NCCL test. When running NCCL test, Open MPI is used only as a launcher, e.g. the actual network traffic does not go through Open MPI. The actual network traffic was through NCCL. If configured properly, NCCL will use aws-ofi-nccl plugin, which use libfabric's EFA proivder for data transfer. If not using EFA, NCCL will use its sockets provider. Can you share your command to run nccl-test and output? (preferably with |
Thanks. I am not so sure fsx is involved because we do not even get to the point where the workload script is invoked. usually things break at the launcher stage, the headnode being unable to launch a large number of compute jobs. What I mean by that is
I do not understand why openmpi or srun techniques fail where intelmpi succeeds. I noticed intelmpi executes some kind of passwordless ssh to connect to the nodes. Maybe the former ones try some file sharing which contests the network. |
we observed libfabric EFA loaded properly on nccl tests run with 4 nodes only. we get some 40GB/s average busbw as an experiment I would try to use intelmpi instead of openmpi for nccl tests. but I could not make it work. I noticed nccl-tests were built with MPI support, not sure if they took something from openmpi as a dependency this is how I install all nccl stack including tests https://github.com/rvencu/1click-hpc/blob/main/modules/45.install.nccl.compute.sh |
I see. Thanks! So the problem is that Open MPI failed to lauch job. When you use open mpi's |
we rely on slurm indeed |
we are investigating slurm db connections that seems to be in trouble. the rds instance we used is minimal and apparently the DB is not able to answer many queries at once when we try to launch massive workloads |
Thank you! I think that using openmpi's |
we made a bigger db instance and this eliminated the db errors from the log but actually not more than this. still having the same issues can you recommend how to use the hostfile option in parallelcluster context? we do not have compute nodes active, they get spin up every time and hostnames and IPs are unknown at launch time |
we get these kind of errors now
|
We started a more in depth debugging process and got some partial success by tuning some timeouts from slurm.conf and potentially |
Thanks @rvencu for sharing all your improvements with the community. According to OpenMPI documentation:
Anyway if you want to give a try you can generate the hostfile with a submission script like:
note that |
So the main change in slurm.conf was this To summarize all 3 changes that improved our installation:
we will benchmark how far can we scale up now at a later time when more p4d nodes will be available |
we noticed great reduction of headnode congestions but now the larger the scale,more compute to compute connections fail. and I believe we are in the nodes becoming ready to launch the job phase are losing IP connectivity to one another just to say: hey I am alive. I do not believe EFA is even triggered at this stage also slurm messaging or openmpi messaging is more impacted than intelmessaging we tried some timing debugging and we see the rate at which nodes become available is slow, too slow to feel normal |
@rvencu could you share how are you submitting the job ( When you're talking about the Database, do you mean you have Slurm accounting enabled? Could you also check the if there are NFS related errors in It might be useful to check VPC Flow Logs to verify REJECTED messages, see:
Then, it's good to double check if you're following all the best practices on Slurm High Throughput official doc. Enrico |
thanks. here are my nccl tests that fail if there are too many nodes
this uses openmpi mpirun obviously. with intelmpi we just load the intelmpi module and source the env as instructed, everything else stays the same we submit with the Yes, we have mySQL on RDS instance and want to use slurm accounting but right now nothing is configured except slurmdbd running as a service. with a small DB we encountered errors consistent with the database not being able to cope with the number of requests. Though we had some bug in the slurmdbd service definition that also made the process die so I would not try to debug this issue just yet.
tonight we get 200 more p4d instances into a new ODCR. We will restart scaling tests since we have this capacity. I just activated the VPC flow logs for the entire VPC but filtered to REJECTED We read the high throughput official doc from slurm but of course we will double check Will try to capture and bring more logs from upcoming tests |
The following is a gist making an example of failing python script using pytorch ddp |
we had a controller meltdown last night without traces in the log file. restarting the slurmctld service has restored the controller. the reason I am mentioning this here is that I saw this line in the log when I restarted the service:
perhaps we will need to revisit the setting we modified above and solve the root cause, not the symptom... |
we have rejected connections on the flow logs like this
I filtered source and destination to be from our VPC CIDR range. I am not sure if these are related to failed jobs, I think I need to log the nodes IPs then search in logs again |
I tracked down 2 nodes that could not communicate and there were no rejected attempts in the flow logs
I searched for all source and destinations like 172.31.23?.??? and there were no records this leads me to think what is really happening is that the node 6 came alive earlier, found the list of all upcoming other nodes and tried to connect to them. node 33 came up too late, after the connection timeout we need to understand why the nodes come up so slowly one after another. might be the post install script taking too long (while I do not see why the network should not work during that time)? or simply spinning up many p4d nodes at once is forming some kind of slow queue in AWS virtualization? or an artefact of the ODCR mechanism.... |
Spun up a cluster with 160 nodes p4d permanently up and retried nccl-tests. Same issue more than 36 nodes and tests fail the same way. So time to bring up the nodes and run post-install scripts does not matter. |
Hi @rvencu , I just want to get some clarity on your latest test:
Did all of the compute nodes successfully start? Was this without the post install script? |
I spun up 160 nodes (down from 174 presented as available to avoid last minute insufficient capacity errors). They are with post-install scripts and all. I guess the only script that is not ran is the slurm prolog I added some custom things in that one but the problem manifest the same as before. custom things were done to tame the cuda in the containers https://github.com/rvencu/1click-hpc/blob/main/scripts/prolog.sh At least this is my current understanding - when launched the compute node execute post-install scripts automatically while prolog is executed before a job is started |
Actually the only thing that seems to run better (does not mean perfect) from all our tests is:
|
@rvencu -- did you try specifying the hostfile directly with openmpi (see: #4179 (comment))? Curious if that changes the behavior at all? |
this is the result with the explicit hostfile, slightly different but the same (maybe it is headnode to compute problem)
|
well going to square 1, and trying super basic things on 42 nodes debug.sh
test.sh
result:
|
it seems that even 36 nodes barely work
|
Hi @rvencu -- this is valuable new information, are the new tests done with a post-install script as well? Out of curiosity what is the OS you are using? Can you also paste the full cluster config and link/paste post-install script if you're using one? |
It would be great to see the other side of this bisection effort (eg a case where the nodes boot successfully) of you haven't already can you remove the prolog and post install to see if the nodes are able to start without any additional customizations? |
The task launcher seems the key here. "-bootstrap slurm" tells ORTE using srun to launch tasks in cascaded way, making it more scalable. Calling srun on the first node to launch all tasks has limited scalability. |
We clearly need to learn how to design the cascading scalability and attempt that... Do you have any documentation suggestions? |
It's kind of difficult to find a clear document on this. (1) I'd start with https://www.open-mpi.org/faq/?category=large-clusters. (2) Try to use OpenMPI built-in scheduler integration for launching tasks (--mca parameter). |
@rvencu -- I am curious also if it is related to this issue: open-mpi/ompi#4578 and perhaps adding |
I am preparing to do so. In the meanwhile slurm support suggested this
Will update here when I have results |
As a sidenote, running so many test jobs with large number of nodes and watching squeue I noticed that releasing the nodes is also a long process, and the nodes are released slowly, with similar timing as we got at start And there is no epilog script in the configuration. Thought this is worth mentioning |
Yes -- I also note that the prolog script makes calls to the AWS api which risks throttling at higher scale which I why I wanted to dig into how much of an effect that might be causing versus potentially other issues. |
ok, getting prolog out of config makes the launch timing to fall to 1 second for 40 nodes so yes, this seems to be the root cause I will introduce now the things that seem safe and time them again to detect where is the bottleneck |
@rvencu -- excellent! Are you able to scale to all 200 nodes without the prolog? |
Do not have 200 nodes but I ran nccl tests with 120 nodes I found available and it was successful. Problem in prolog - instances tagging API is slow, need to move that separately, async.
|
Okay great -- so it sounds like the root cause has been identified for this particular issue? As the prolog script is not directly a part of this project, I will mark this issue as closed if that is the root cause. |
thanks a lot. |
We started an HPC cluster and placed the HeadNode into a public subnet for easy access from the internet while the compute fleet sits on a private subnet.
While we had a limited number of p4d instances available things were running normal but we recently got new capacity reservation and node count went up to 194, allowing us to try large jobs with over 100 nodes / 800 GPUs with EFA enables, these instances having 4 100Gbps network interfaces each (so > 40000 Gbps per job).
We noticed jobs failure due to network errors both in headnode-computenode operations as well as compute-to-compute comms.
Especially openmpi seems sensitive to this, it fails to launch batch jobs for more than 35 nodes. Intelmpi seems more resilient, we were able to run 80 nodes jobs but nothing above that. We also had an event when slurm master lost connection to the entire fleet, sending all compute down and rescheduling all jobs.
I want to get some network recommendations for this scenario. What I am thinking now is to move the headnode in the same private subnet as compute fleet (and provide SSH connectivity via tunneling).
Is there anything that we could do at the level of VPC/subnet configurations to ease up the congestion?
The text was updated successfully, but these errors were encountered: