-
Notifications
You must be signed in to change notification settings - Fork 17
First class network stack support #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -87,7 +87,7 @@ def main(): | |||
) | |||
return 0 | |||
except Exception as e: | |||
root_logger.debug(traceback.format_exc()) | |||
root_logger.info(traceback.format_exc()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of pushing Python stack traces to the logs, just dump them on the screen given that most users know something about the internals of stackinator.
Co-authored-by: Alberto Invernizzi <[email protected]>
- spec: [email protected] fabrics=cxi,rxm,tcp | ||
prefix: /opt/cray/libfabric/1.22.0/ | ||
version: ["git.v2.2.0=main"] | ||
require: fabrics=cxi,rxm,tcp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an example, but do you need lnx
for OpenMPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know - I am copying off what @biddisco provided.
We can update the examples, and the configuration in the network.yaml
files, separately from this work (which focuses on the infrastructure).
Co-authored-by: Rocco Meli <[email protected]>
Co-authored-by: Rocco Meli <[email protected]>
Add `network.yaml` files to the cluster configurations, to support building OpenMPI and low-level network libraries. The stackinator work that uses `network.yaml` was merged: eth-cscs/stackinator#256 Note that the definitions of libfabric, etc, are currently pinned to commits on `main`/`master` of the source repositories, which isn't ideal - but with this feature we can start using concrete versions when it is possible.
The method for adding MPI and tuning the networking libarary stack has been simplified.
There are two components that define the network stack.
The first is the new
network
field inenvironments.yaml
that replaces the oldmpi
field.The user provides a
spec
for the MPI distribution they want (currently cray-mpich and openmpi are supported), and an optional list of specs for dependencies (e.g. libfabric):The other half is the new
network.yaml
file in the system configuration, that provides a default list ofspecs
for each MPI distribution, and apackages.yaml
description of the default / required variants and options for cray-mpich, openmpi, libfabric, etc.For example, using this approach, the
+cuda
variant can be set as a default on a GH200 cluster, and disabled on a CPU-only system.Based off #210.