-
Notifications
You must be signed in to change notification settings - Fork 26
Add more environment variables for NCCL #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
preview available: https://docs.tds.cscs.ch/146 |
1 similar comment
preview available: https://docs.tds.cscs.ch/146 |
docs/software/communication/nccl.md
Outdated
export NCCL_CROSS_NIC=1 | ||
export NCCL_NET_FORCE_FLUSH=1 | ||
export NCCL_NET_GDR_LEVEL=PHB # (2)! | ||
export NCCL_SOCKET_IFNAME=hsn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL_CROSS_NIC and NCCL_SOCKET_IFNAME are set in the CE hook, and seem safe to recommend for most users (latter anyway seems to be just a sanity setting to avoid using the wrong network).
Good catch. I wonder if we can avoid copy-pasting and manually having to make sure they're synchronized. I like that the pytorch submission script is standalone. Do you think it would be bad if we just link to the nccl page from there? I'd imagine many users will miss copying the nccl variables in that case... From a quick search this also seems to exist: https://squidfunk.github.io/mkdocs-material/reference/code-blocks/#embedding-external-files. That might allow defining these in one place and including in many. That said, that might be overkill at the moment so I might just copy them over for now. Any comments on which vars we can actually safely recommend and which we might want to wait with still? Or if we have to add a warning about some variables only being good/useful with nccl 2.26 and libfabric 1.22? |
preview available: https://docs.tds.cscs.ch/146 |
6314979
to
ef3019e
Compare
The added environment variables are now in ef3019e. |
preview available: https://docs.tds.cscs.ch/146 |
preview available: https://docs.tds.cscs.ch/146 |
959ec3a
to
ef4ec10
Compare
preview available: https://docs.tds.cscs.ch/146 |
Draft. Not clear if all are needed.
Do we need to start separating recommended environment variables by nccl, libfabric, etc. version, or is it sufficient if we recommend the best practices for the latest versions?