Skip to content

Conversation

bcumming
Copy link
Member

@bcumming bcumming commented Aug 11, 2025

The method for adding MPI and tuning the networking libarary stack has been simplified.

There are two components that define the network stack.
The first is the new network field in environments.yaml that replaces the old mpi field.
The user provides a spec for the MPI distribution they want (currently cray-mpich and openmpi are supported), and an optional list of specs for dependencies (e.g. libfabric):

network:
  mpi: <spec>
  specs: [<spec>, ...]

The other half is the new network.yaml file in the system configuration, that provides a default list of specs for each MPI distribution, and a packages.yaml description of the default / required variants and options for cray-mpich, openmpi, libfabric, etc.

For example, using this approach, the +cuda variant can be set as a default on a GH200 cluster, and disabled on a CPU-only system.

Based off #210.

@@ -87,7 +87,7 @@ def main():
)
return 0
except Exception as e:
root_logger.debug(traceback.format_exc())
root_logger.info(traceback.format_exc())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of pushing Python stack traces to the logs, just dump them on the screen given that most users know something about the internals of stackinator.

Co-authored-by: Alberto Invernizzi <[email protected]>
- spec: [email protected] fabrics=cxi,rxm,tcp
prefix: /opt/cray/libfabric/1.22.0/
version: ["git.v2.2.0=main"]
require: fabrics=cxi,rxm,tcp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an example, but do you need lnx for OpenMPI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know - I am copying off what @biddisco provided.

We can update the examples, and the configuration in the network.yaml files, separately from this work (which focuses on the infrastructure).

@bcumming bcumming merged commit 5581430 into eth-cscs:main Aug 21, 2025
2 checks passed
bcumming added a commit to eth-cscs/alps-cluster-config that referenced this pull request Aug 21, 2025
Add `network.yaml` files to the cluster configurations, to support
building OpenMPI and low-level network libraries.

The stackinator work that uses `network.yaml` was merged:
eth-cscs/stackinator#256

Note that the definitions of libfabric, etc, are currently pinned to
commits on `main`/`master` of the source repositories, which isn't ideal
- but with this feature we can start using concrete versions when it is
possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants