Skip to content

Add more environment variables for NCCL #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

msimberg
Copy link
Contributor

@msimberg msimberg commented Jun 6, 2025

Draft. Not clear if all are needed.

Do we need to start separating recommended environment variables by nccl, libfabric, etc. version, or is it sufficient if we recommend the best practices for the latest versions?

@msimberg msimberg requested review from Madeeks and boeschf June 6, 2025 12:39
Copy link

github-actions bot commented Jun 6, 2025

preview available: https://docs.tds.cscs.ch/146

1 similar comment
Copy link

github-actions bot commented Jun 6, 2025

preview available: https://docs.tds.cscs.ch/146

Comment on lines 21 to 24
export NCCL_CROSS_NIC=1
export NCCL_NET_FORCE_FLUSH=1
export NCCL_NET_GDR_LEVEL=PHB # (2)!
export NCCL_SOCKET_IFNAME=hsn
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL_CROSS_NIC and NCCL_SOCKET_IFNAME are set in the CE hook, and seem safe to recommend for most users (latter anyway seems to be just a sanity setting to avoid using the wrong network).

@boeschf
Copy link
Contributor

boeschf commented Jun 6, 2025

we should also change https://github.com/msimberg/cscs-docs/blob/63149793755baedc9052b2e4aad920f01b266f33/docs/software/ml/pytorch.md?plain=1#L318

@msimberg
Copy link
Contributor Author

msimberg commented Jun 6, 2025

we should also change https://github.com/msimberg/cscs-docs/blob/63149793755baedc9052b2e4aad920f01b266f33/docs/software/ml/pytorch.md?plain=1#L318

Good catch. I wonder if we can avoid copy-pasting and manually having to make sure they're synchronized. I like that the pytorch submission script is standalone. Do you think it would be bad if we just link to the nccl page from there? I'd imagine many users will miss copying the nccl variables in that case...

From a quick search this also seems to exist: https://squidfunk.github.io/mkdocs-material/reference/code-blocks/#embedding-external-files. That might allow defining these in one place and including in many.

That said, that might be overkill at the moment so I might just copy them over for now.

Any comments on which vars we can actually safely recommend and which we might want to wait with still? Or if we have to add a warning about some variables only being good/useful with nccl 2.26 and libfabric 1.22?

Copy link

github-actions bot commented Jun 6, 2025

preview available: https://docs.tds.cscs.ch/146

@msimberg
Copy link
Contributor Author

@RMeli I opened #152 for the snippets idea. I'll reserve this PR to actually add environment variables that haven't been recommended yet (at least not in the docs).

@msimberg msimberg force-pushed the more-nccl-env-vars branch from 6314979 to ef3019e Compare June 11, 2025 15:55
@msimberg
Copy link
Contributor Author

The added environment variables are now in ef3019e.

Copy link

preview available: https://docs.tds.cscs.ch/146

Copy link

preview available: https://docs.tds.cscs.ch/146

@msimberg msimberg force-pushed the more-nccl-env-vars branch from 959ec3a to ef4ec10 Compare June 13, 2025 14:39
Copy link

preview available: https://docs.tds.cscs.ch/146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants