-
Notifications
You must be signed in to change notification settings - Fork 920
Update PMIx and PRRTE #7566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PMIx and PRRTE #7566
Conversation
@artemry-mlnx I fully expect this to fail Mellanox CI due to the changes in cmd line parsing rules when it comes to "--tune", "--amca", "--mca", and "-x" options. I'll need your help to work thru the updates to your CI |
@rhc54 sure, I'll deal with the CI script update. |
Well, host |
No, it isn't anything to do with IBM CI - the MCA params need to be properly prefixed in the OMPI case and they aren't. I'm fixing it |
Just to follow up on this question: Yeah the 172.18.0.x network is the overlay inside the virtual cluster. It is private and routable between only the "virtual nodes" in that cluster. It shouldn't be getting in the way. It sounds like Ralph is working the real issue. |
Inside an salloc job, I see that mpirun of a non-MPI job works fine, but mpirun of an MPI job fails:
|
Not sure I can understand that one - did the daemon on the remote host segfault? Otherwise, it's hard to understand why the PRRTE infrastructure would care about MPI vs non-MPI apps |
I am now unable to reproduce the error. Weird, because it was 100% reproducible before. 🤷♂ Sooo... let's ignore my report, I guess! |
What's the failure in the Mellanox CI? Is that some in-flight CLI change that the Mellanox CI needs to be updated for? |
Yes and no. Yes, the specific test should return an error because they included the same envar in two different tune files, and assigned it a different value. This is no longer permitted. No in that there are still some bugs in the parser as it didn't error out as it should. I'm working on that one, and then we'll have to address the test itself. |
@artemry-mlnx I believe I have this working now and reporting cmd line errors as it should. There are two tests that are going to fail:
These will fail because they "reset" values, which is no longer allowed. You can override values from a file by putting a new value on the cmd line itself, but values from within two files (or even within a single file) are not allowed to override each other. I'm not sure how you would like to fix those - we could just turn them off, I suppose, or you could fix the override. Up to you. |
Deprecate --am and --amca options Avoid default param files on backend nodes Any parameters in the PRRTE default or user param files will have been picked up by prte and included in the environment sent to the prted, so don't open those files on the backend. Avoid picking up MCA param file info on backend Avoid the scaling problem at PRRTE startup by only reading the system and user param files on the frontend. Complete revisions to cmd line parser for OMPI Per specification, enforce following precedence order: 1. system-level default parameter file 1. user-level default parameter file 1. Anything found in the environment 1. "--tune" files. Note that "--amca" goes away and becomes equivalent to "--tune". Okay if it is provided more than once on a cmd line (we will aggregate the list of files, retaining order), but an error if a parameter is referenced in more than one file with a different value 1. "--mca" options. Again, error if the same option appears more than once with a different value. Allowed to override a parameter referenced in a "tune" file 1. "-x" options. Allowed to overwrite options given in a "tune" file, but cannot conflict with an explicit "--mca" option 1. all other options Fix special handling of "-np" Get agreement on jobid across the layers Need all three pieces (PRRTE, PMIx, and OPAL) to agree on the nspace conversion to jobid method Ensure prte show_help messages get output Print abnormal termination messages Cleanup error reporting in persistent operations Signed-off-by: Ralph Castain <[email protected]> dd Signed-off-by: Ralph Castain <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are working for me.
@artemry-mlnx Any estimate on when you will update the tests? We are hitting a bit of a problem in that PRRTE is now ahead of OMPI and cannot launch OMPI jobs without this PR being committed. |
Properly mark/detect that a daemon sourced the event broadcast to avoid reinjecting it into the PMIx server library. Correct the source field for the event notify call on launcher ready. Update event notification for tool support Deal with a variety of race conditions related to tool reconnection to a different server. Signed-off-by: Ralph Castain <[email protected]>
@rhc54 working on the script update. |
libevent, hwloc, and pmix can be external and may require that their libs be explicitly linked into the PRRTE binaries Signed-off-by: Ralph Castain <[email protected]>
@artemry-mlnx Perhaps the easiest approach is for me to give you a pull request with the required changes? That will give me a chance to manually check each test and determine the required changes |
@artemry-mlnx I believe mellanox-hpc/jenkins_scripts#96 will fix the CI tests. Can you please review and we can recheck here? |
Signed-off-by: Ralph Castain <[email protected]>
@rhc54 |
@artemry-mlnx I ran the updated tests by hand and they all worked, so I believe it should pass. If not, I can deal with any failures once I see them. |
@rhc54 ompi/.ci/mellanox/azure-pipelines.yml Line 13 in 5dcd1f4
ompi/.ci/mellanox/azure-pipelines.yml Line 14 in 5dcd1f4
|
@artemry-mlnx Done - let's see how it does! |
@rhc54 it's passed, please revert back Azure Pipeline and I'll proceed with the merge for jenkins_scripts. |
@artemry-mlnx Done! |
@rhc54 |
@artemry-mlnx Yes - though it isn't clear that this PR is going to pass without my local CI change unless you commit the PR, true? |
That's true - merged first. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@artemry-mlnx Thanks! |
Deprecate --am and --amca options
Avoid default param files on backend nodes
Any parameters in the PRRTE default or user param files will have been
picked up by prte and included in the environment sent to the prted, so
don't open those files on the backend.
Avoid picking up MCA param file info on backend
Avoid the scaling problem at PRRTE startup by only reading the system
and user param files on the frontend.
Complete revisions to cmd line parser for OMPI
Per specification, enforce following precedence order:
Fix special handling of "-np"
Fixes #7565
Get agreement on jobid across the layers
Need all three pieces (PRRTE, PMIx, and OPAL) to agree on the nspace
conversion to jobid method
Ensure prte show_help messages get output
Print abnormal termination messages
Fixes #7564
Add extra libs to PRRTE binaries for external deps
libevent, hwloc, and pmix can be external and may require that their
libs be explicitly linked into the PRRTE binaries
Fix scalable launch with rsh - resolve several issues in tree-spawn
Signed-off-by: Ralph Castain [email protected]