Skip to content

orted/pmix: fix spawn in singleton mode #2084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 20, 2016

Conversation

ggouaillardet
Copy link
Contributor

invoke orte_pre_condition_transports() in order to set the
ORTE_JOB_TRANSPORT_KEY attribute.

@ggouaillardet
Copy link
Contributor Author

@rhc54 could you please review this ?

currently, spawn is broken on master with singleton.

that can be evidenced by running ./spawn from the ibm test suite.
v2.x is fine.

once the patch is applied, it is then also possible to run
OMPI_MCA_rmaps_base_oversubscribe=true OMPI_MCA_hwloc_base_binding_policy=none ./spawn_multiple

unfortunatly, spawn_multiple in singleton mode does not work on v2.x ...

@rhc54
Copy link
Contributor

rhc54 commented Sep 15, 2016

I don't think that is actually going to do what it is supposed to do. The purpose of the transport key attribute is to record the key used by the original parent job (in this case, the singleton) so the child can be given the same key. This allows PSM2 to enable communication between the parent and child jobs.

What this patch does is assign a new key to the parent job, and then pass that key to the child. It isn't the same key used by the parent singleton, and so communication won't work.

What we need is for the singleton to pass the transport key to the HNP in the spawn request. We can add a PMIx attribute for this purpose. Then the patch would be to detect that PMIx attribute and set the transport key attribute in the parent job so that the PLM can do the right thing.

Make sense?

@ggouaillardet
Copy link
Contributor Author

It does make sense

Would it be easier to pass the key on the orted command line ?
Or have orted return it in its very first reply to the singleton (e.g. extend the [PMIX_...=...,...] reply) ?

@rhc54
Copy link
Contributor

rhc54 commented Sep 15, 2016

Actually, I was thinking that the singleton can just add the transport key to the spawn command as another "info" key. So in the ompi/dpm, we would lookup and add the transport key to the pmix.spawn attribute list

@ggouaillardet
Copy link
Contributor Author

@rhc54 i gave it some more thoughts and came with a fix i think is both correct and simpler.
(e.g. singleton only is fixed, so fix only this)

unless i am missing a broader rationale, adding yet an other info key to spawn in order to only fix the singleton case looks like an overkill to me.
also, with this commit, the singleton is assigned a random key, and not the (a bit more predictable) the jobid.

could you please double check it ?

@rhc54
Copy link
Contributor

rhc54 commented Sep 19, 2016

That should be okay so long as the ess/singleton code is adjusted to fit. You need to ensure that the ess code doesn't call precondition_transport, and that it properly handles another envar.

@ggouaillardet
Copy link
Contributor Author

i updated the patch
ess/singleton cannot call orte_pre_condition_transports_print and i added an assert to triple check we would have not done it.
master does not check the number of env variables returned by orted, only v2.x does. so i guess this patch is ok for master, but extra care will have to be taken care of for the PR

@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2016

Yeah, I think it's okay too - might need to watch out to see if we have to increase the buffer size for the return string from the orted as we are adding things to it.

@ggouaillardet
Copy link
Contributor Author

do you mean ORTE_URI_MSG_LGTH ?
my understanding is that the receive buffer is realloc'ed when needed, and on the sender size, asprintf is used, so current code is ok. increasing that value would simply avoid a realloc.
i just noted orted does not check the result of write (so it only hopes everything was written at once) and ess/singleton assumes read will never fail. i will push an other commit for that.

in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports()
and send the transport key back to the singleton
@ggouaillardet ggouaillardet merged commit 83399ad into open-mpi:master Sep 20, 2016
@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2016

You are correct - one thing I hate about this interface is that you don't see the changes in the full context of the file without lots of "clicks". We should be good to go.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants