-
Notifications
You must be signed in to change notification settings - Fork 106
With most recent Ubuntu packages upgrade, enroot container load fails #232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FYI. Definitely an issue with one of |
Same situation here. DGX + Slurm + Nvidia kernel + pyxis/enroot. Last update broke starting containers. Can confirm that downgrading the container related pkgs to 1.17.6-1 works. |
The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm. - NVIDIA/nvidia-container-toolkit#1091 - NVIDIA/nvidia-container-toolkit#1093 - NVIDIA/enroot#232 While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions. This mitigates GoogleCloudPlatform#4144
The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm. - NVIDIA/nvidia-container-toolkit#1091 - NVIDIA/nvidia-container-toolkit#1093 - NVIDIA/enroot#232 While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions. This mitigates GoogleCloudPlatform#4144
There was a regression introduced in v1.17.7. The issue was identified and fixed in NVIDIA/libnvidia-container#310. We will include the fix in our next point release. |
The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm. - NVIDIA/nvidia-container-toolkit#1091 - NVIDIA/nvidia-container-toolkit#1093 - NVIDIA/enroot#232 While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions. This mitigates GoogleCloudPlatform#4144
We have a DGX H100 system, and we're running Slurm with latest Enroot/Pyxis. Since the most recent upgrade of the nvidia kernel, nvidia-container-toolkit and other ubuntu packages Enroot fails to load containers.
Log from apt upgrade
The error is
Adding
no_cgroups = true
to/etc/nvidia-container-runtime/config.toml
like https://docs.nvidia.com/ai-enterprise/deployment/cpu-only/latest/runtimes.html#rootless-container-setup-optional and NVIDIA/libnvidia-container#154, does not help.The text was updated successfully, but these errors were encountered: