-
Notifications
You must be signed in to change notification settings - Fork 900
master/v5.0.x: missing documentation on MPI_Comm_join/Publish usage (client/server pattern) #10222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Good point. I just added this to the docs to-do list in #10256. If you'd like to add this to the upcoming v5.x docs, please feel free to open a PR! |
Thank you @jsquyres ! Before opening a PR for this, I would like to understand if actually In case it is, and it is the only way, it would also be very important in my opinion to give some kind of guidance on its usage with relation to the classic Thank you again! |
That is an excellent question. I'm afraid I don't know the answer. @bosilca @abouteiller This user is volunteering to write / amend some docs. Can you help answer the questions above? |
ULFM does not join worlds, instead we spawn processes from an existing job, so things are slightly simpler there (not that they consistently work, but at least we do not have the issue with exchanging the modex). Any documentation, especially in a so rarely used area as connect/accept, will be of tremendous help. |
Split out from #10480 |
Downgraded from "blocker" to "critical". |
I did some investigation on this issue, and here is the precise list of steps required to make connect/accept, join, and publish/unpublish work with ompi v5.0. The key hints are provided by @klaa97 above, I am just trying to document here the sequence of steps.
A few more things:
|
Errr...that will sometimes work, but some steps aren't actually required and won't work in some situations. I'll try to provide a more generic set of steps in a bit. |
ok, thank you, any details and additional information would be appreciated |
There are two cases to consider: If you own all of the nodes involved in the session (i.e., you are not sharing nodes), then you can start PRRTE with If you are sharing nodes, then you cannot use the system-server method as there can be only one system server at a time. Instead, you start PRRTE with You can substitute @naughtont3 Note that the |
@rhc54 thank you very much, I will try to put this information into the docs. One follow up question: would this also work for direct launch (e.g. srun), or is it bound to prrte/mpirun utilization? |
I'm afraid Slurm does not include support for these operations, so it is constrained to PRRTE. |
Completed in #11776 |
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.0rc4
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed Open MPI from distribution tarball of v5.0.0rc5 from https://www.open-mpi.org/software/ompi/v5.0/
Please describe the system on which you are running
Details of the problem
I am trying to replicate a simple client/server MPI application using
MPI_Comm_accept and MPI_Comm_connect, together with MPI_Publish / Lookup . Before version 5.0.x, I used the
ompi-server
command to allow the communication between different executions, but since ORTE is deprecated as runtime, the previous method does not work anymore, and as expected, the processes do not have any shared registry where the information is published or where they can connect.A minimal example below.
server.c
client.c
Moreover, even if I communicate the server port to the client in other ways
(such as printing on a file), the two processes hang (I am considering
mpirun
as a runtime).Possible solution
The current solution that I employing, following the PRRTE model, is the following:
prte
(as a system server for simplicity)prun --system-server-only
Is this the expected and correct solution? is there any other way to connect different MPI executions (other than MPI_Comm_spawn) ?
I also suggest a sort-of migration guide for people that were used to
ompi-server
flow, maybe directly in the documentation of these MPI directives that need a DVM to work (MPI_Comm_join, Lookup/Publish, Accept/Connect, come to my mind, but I may be missing some 🤔 ). Thank you very much!The text was updated successfully, but these errors were encountered: