Skip to content

oob/tcp: add cross version compatibility support #6157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ggouaillardet
Copy link
Contributor

Since we intend to provide cross version compatibility
between versions with the same major and minor, use
MAJOR.MINOR.0 instead of orte_version_string
(e.g. MAJOR.MINOR.RELEASEGREEK).

Open MPI 4.0.0 has already been released, so in order to make
it compatible with future 4.0.x releases, we have to use 4.0.0
as the version string, that is why we use MAJOR.MINOR.0 instead
of MAJOR.MINOR

Signed-off-by: Gilles Gouaillardet [email protected]

Since we intend to provide cross version compatibility
between versions with the same major and minor, use
MAJOR.MINOR.0 instead of orte_version_string
(e.g. MAJOR.MINOR.RELEASEGREEK).

Open MPI 4.0.0 has already been released, so in order to make
it compatible with future 4.0.x releases, we have to use 4.0.0
as the version string, that is why we use MAJOR.MINOR.0 instead
of MAJOR.MINOR

Signed-off-by: Gilles Gouaillardet <[email protected]>
Copy link
Member

@jjhursey jjhursey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.
Does the same change need to be made for the 'alps' component?

I think this change is fine. My concern is the buy in from the OMPI community to maintain this cross version commitment. I don't think that it would be difficult within a major+minor release series, but I would like to see some testing to make sure nothing sneaks in.

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2018

Just a suggestion. You have the actual version string there, so why not just check for it starting with "4.0", and then know that you have to use the entire string for that one case? You could check for major number -> if it is 4, then check for minor -> if it is 0, then check the release for 0.

@ggouaillardet
Copy link
Contributor Author

@jjhursey I could not find any version string being used in the oob/alps component.
Could you raise the commitment/testing issue at the telcon ?

I am not sure of what vendor MPI are sending through the wire
(e.g. is this a "parseable" version or a custom string), could you double check that too ?
(just check the value of OPAL_IDENT_STRING in opal/include/opal_config.h)

@rhc54 my understanding is that both ends of an oob/tcp connection check each other version.
4.0.0 has already been release, and

  • sends 4.0.0 (we can only check for 4.0 in the next releases)
  • expects 4.0.0 (which means future release have to send 4.0.0, unless mpirun is made aware (manually or automatically) it is talking to orted 4.0.0)
    Regardless of vendors using a custom version string, and unless I am missing something, I do not see how your suggestion can work if we intend to have mpirun 4.0.1 work with orted 4.0.0.

@rhc54
Copy link
Contributor

rhc54 commented Dec 11, 2018

@ggouaillardet You missed my point, but I likely didn't clearly explain it. Check out the handshake procedure. The orted sends its connection info first to mpirun. Thus, mpirun has the opportunity to fully parse the orted's version string to determine its exact major, minor, and release number. As a result, it can use that knowledge to return a version string that will allow the remote end to make the desired decision.

This, of course, only solves the problem where mpirun is at a release level equal to or higher than the orted - yet that is the scenario you will almost always encounter. The case where mpirun is older than the orted would be rare - however, you could work around that as well by having the newer mpirun add a flag to the orted cmd line indicating it's "compliant". If that flag isn't there, then the orted knows it is an older mpirun launching it and can again send an appropriate version string (in this case, we would send 4.0.0 since we know we don't support any other combination).

Up to you - just pointing out that you have adequate info and controls to solve it.

@ggouaillardet
Copy link
Contributor Author

@rhc54 thanks ! I obviously missed the part when orted "speaks first".

@jjhursey did you get a chance to discuss this topic at the telcon ?
If we plan to commit on supporting this, what is the exact commitment ?
mpirun and orted must have the same major and version. do we also require mpirun >= orted ?

@jjhursey
Copy link
Member

I had to miss the telecon this week. I can bring it up next week.
@gpaulsen Can you add this PR to the agenda?

@ggouaillardet
Copy link
Contributor Author

@hppritcha I marked this as a blocker since 4.0.1 is going to be released soon.

Bottom line, if we want cross-compatibility between 4.0.0 and 4.0.1 in a docker environment, something has to be done before 4.0.1 is released.
This PR is simple enough and can be pushed as a one-off commit in the v4.0.x branch
(and I will implement the better solution proposed by Ralph for master and 4.0.2 later).

If we do not care about cross-compatibility, then please remove the blocker label.

@hppritcha
Copy link
Member

For purposes of clarification, the alps oob is essentially a do nothing for the case of native launch with alps aprun or slurm srun, so no impact of this PR on alps.

@hppritcha
Copy link
Member

closing because we're looking for a better long term fix on master.

@hppritcha hppritcha closed this Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants