Skip to content

orted: fix tree-spawn when the node regex is too long #4637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ggouaillardet
Copy link
Contributor

When the node regex is too long to be sent on the command line,
retrieve it first from the parent, and then spawn the remote orted

Signed-off-by: Gilles Gouaillardet [email protected]

@ggouaillardet
Copy link
Contributor Author

note to myself:

  • do we also need to retrieve the prefix when receiving the regex ?
  • it seems remote_spawn() is always invoked with a NULL parameter, is there some dead code to be removed ?

@rhc54
Copy link
Contributor

rhc54 commented Dec 19, 2017

Looks okay to me - thanks!

You shouldn't need to retrieve the prefix because you were already given it on the cmd line that started the orted.

There is indeed some dead code to remove. Now that the daemon calls remote_spawn itself, there is no longer a need for the "tree_spawn" cmd (odls/odls_types.h) or the associated cmd processing code in orted/orted_comm.c as the HNP is no longer sending a tree-spawn message to the orted.

@ggouaillardet ggouaillardet force-pushed the topic/tree_spawn_no_regex branch 3 times, most recently from 33d706b to 3cbf39b Compare December 24, 2017 14:06
@jsquyres
Copy link
Member

jsquyres commented Jan 3, 2018

@ggouaillardet @rhc54 Is this one ready to merge?

@@ -44,7 +46,7 @@ typedef uint8_t orte_daemon_cmd_flag_t;
#define ORTE_DAEMON_KILL_LOCAL_PROCS (orte_daemon_cmd_flag_t) 2
#define ORTE_DAEMON_SIGNAL_LOCAL_PROCS (orte_daemon_cmd_flag_t) 3
#define ORTE_DAEMON_ADD_LOCAL_PROCS (orte_daemon_cmd_flag_t) 4
#define ORTE_DAEMON_TREE_SPAWN (orte_daemon_cmd_flag_t) 5
#define ORTE_DAEMON_TREE_SPAWN_UNUSED (orte_daemon_cmd_flag_t) 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just remove this cmd - no need to renumber the rest.

@rhc54
Copy link
Contributor

rhc54 commented Jan 3, 2018

I think it is, yes - but I defer to @ggouaillardet

since open-mpi/ompi@8f496b0
rsh_wait_daemon is invoked with an orte_wait_tracker_t *,
that must be used to reach the orte_plm_rsh_caddy_t *.

Signed-off-by: Gilles Gouaillardet <[email protected]>
…aitpid_cb()

since open-mpi/ompi@8f496b0
sstore_stage_local_compress_waitpid_cb is invoked with an orte_wait_tracker_t *,
that must be used to reach the orte_sstore_stage_local_app_snapshot_info_t *.

Signed-off-by: Gilles Gouaillardet <[email protected]>
This parameter can be used to set the node regex max length that can
be passed to the orted command line.
For testing purpose, it can be set to zero in order to force the node regex
being retrieved by orted from its parent.

Signed-off-by: Gilles Gouaillardet <[email protected]>
When the node regex is too long to be sent on the command line,
retrieve  it first from the parent, and then spawn the remote orted

Signed-off-by: Gilles Gouaillardet <[email protected]>
Now that the daemon calls remote_spawn itself, there is no longer
a need for the "tree_spawn" command nor the associated command
processing code since the HNP is no longer sending a tree-spawn
message to the orted.

Thanks Ralph for the guidance !

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet ggouaillardet force-pushed the topic/tree_spawn_no_regex branch from 3cbf39b to 03da521 Compare January 4, 2018 00:35
@ggouaillardet
Copy link
Contributor Author

ready for primetime, merging now

@ggouaillardet ggouaillardet merged commit 56fe714 into open-mpi:master Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants