-
Notifications
You must be signed in to change notification settings - Fork 900
OMPI 4.0.0rc5 fails to see slots on Cray XC40 #5973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Try adding |
Here is the output from a debug build:
It seems that PBS is chosen over ALPS. The PBS environment:
I also tried increasing the priority of ALPS but to no avail:
|
Hmmm...well that is certainly very confusing. It looks like you have both ALPS and PBS on the same system? How would/does that work? You can eliminate the PBS support by simply configuring OMPI For whatever reason, both the ALPS and PBS allocation parsers agree that you were only allocated 1 slot on each of NID 5775 and 5776. The PBS environment confirms that assignment. Afraid I don't know enough about the ALPS parser to understand that output. Guess you could also check that you don't have a default hostfile or -host set in your environment or MCA param file. |
In case it's of any use: here is the debug output of the plm component:
|
does the application launch if you use aprun rather than mpirun? |
sorry didn't see the note above about working with aprun. |
I only have access to a system running PBSpro but will see if I can reproduce this. |
@devreal what qsub arguments are you using? |
@devreal could you submit a new job and paste the |
qsub command:
PBS Nodefile:
I would expect that ALPS will provide Open MPI with the correct allocation information as that is (from what I understand) the way |
FWIW: that tells OMPI to assign 1 slot from each of those nodes. We only auto-detect slots in unmanaged environments. So you would indeed get 1 slot if the ras/tm component is used. I don't know how the info gets filtered thru ALPS and into aprun. Best guess is that ALPS is also seeing only node assignments, but aprun defaults to assuming slots=cores - which is not what OMPI assumes in a managed environment. |
could you add a ppn parameter, like
alps is being told by PBS that you're only requesting 1 process per node. I believe this may be a site dependent configuration. That being said, for PBS pro, I always tell pbs exactly how many slots/node I want, e.g.
|
Our PBS does not allow me to define the number of processes per node:
It seems to be a site-specific configuration issue and not a problem of Open MPI then. I guess I will go with the |
closing as this does not appear to be either of Open MPI nor a Cray problem, but something about how the site has configured PBS. |
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Open MPI 4.0.0rc5 downloaded from Github
Please describe the system on which you are running
Cray XC40
Details of the problem
I'm trying to run jobs using Open MPI's
mpirun
(so I can set mca parameters directly) with multiple processes per node, which fails:Running a single process per node works as expected:
I can run applications linked against Open MPI using
aprun
(withPMI_NO_FORK
set) and I can pass--oversubscribe
tompirun
to run with multiple ranks per node.Is this the expected behavior? Each node has 24 cores / 48 threads so I would expect Open MPI to see the number of slots available.
I'm attaching the
config.log
. It seems that support for Cray PMI is properly detected.config.log
The text was updated successfully, but these errors were encountered: