Skip to content

With most recent Ubuntu packages upgrade, enroot container load fails #232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
itzsimpl opened this issue May 17, 2025 · 3 comments
Open

Comments

@itzsimpl
Copy link

We have a DGX H100 system, and we're running Slurm with latest Enroot/Pyxis. Since the most recent upgrade of the nvidia kernel, nvidia-container-toolkit and other ubuntu packages Enroot fails to load containers.

Log from apt upgrade

Start-Date: 2025-05-17  11:06:31
Commandline: apt upgrade -y
Requested-By: ubuntu (1000)
Install: linux-image-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-tools-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-modules-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-nvidia-tools-5.15.0-1078:amd64 (5.15.0-1078.79, automatic), linux-modules-extra-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-headers-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-modules-nvidia-fs-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-nvidia-headers-5.15.0-1078:amd64 (5.15.0-1078.79, automatic)
Upgrade: linux-tools-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), openjdk-11-jre:amd64 (11.0.26+4-1ubuntu1~22.04, 11.0.27+6~us1-0ubuntu1~22.04), linux-image-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), python2.7-minimal:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), linux-tools-common:amd64 (5.15.0-139.149, 5.15.0-140.150), openjdk-11-jre-headless:amd64 (11.0.26+4-1ubuntu1~22.04, 11.0.27+6~us1-0ubuntu1~22.04), libldap-common:amd64 (2.5.18+dfsg-0ubuntu0.22.04.3, 2.5.19+dfsg-0ubuntu0.22.04.1), libnvidia-container1:amd64 (1.17.6-1, 1.17.7-1), libldap-2.5-0:amd64 (2.5.18+dfsg-0ubuntu0.22.04.3, 2.5.19+dfsg-0ubuntu0.22.04.1), linux-crashdump:amd64 (5.15.0.139.135, 5.15.0.140.135), linux-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), linux-headers-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), open-vm-tools:amd64 (2:12.3.5-3~ubuntu0.22.04.1, 2:12.3.5-3~ubuntu0.22.04.2), libnvidia-container-tools:amd64 (1.17.6-1, 1.17.7-1), nvidia-container-toolkit:amd64 (1.17.6-1, 1.17.7-1), nvidia-container-toolkit-base:amd64 (1.17.6-1, 1.17.7-1), python2.7:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), libpython2.7-minimal:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), libpython2.7-stdlib:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), linux-libc-dev:amd64 (5.15.0-139.149, 5.15.0-140.150)
End-Date: 2025-05-17  11:09:05
# nvidia-container-toolkit -version
NVIDIA Container Runtime Hook version 1.17.7
commit: bae3e7842ebe26812d8bd6a9be6a14a83dc91d8f

The error is

# srun -c8 --mem 16G --gpus 1 --container-image nvcr.io/nvidia/nemo:25.04 --pty bash
pyxis: importing docker image: nvcr.io/nvidia/nemo:25.04
pyxis: imported docker image: nvcr.io/nvidia/nemo:25.04
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: gn01: task 0: Exited with exit code 1

Adding no_cgroups = true to /etc/nvidia-container-runtime/config.toml like https://docs.nvidia.com/ai-enterprise/deployment/cpu-only/latest/runtimes.html#rootless-container-setup-optional and NVIDIA/libnvidia-container#154, does not help.

@itzsimpl
Copy link
Author

FYI. Definitely an issue with one of libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit nvidia-container-toolkit-base, as downgrading them to 1.17.6-1 resolves the issue.

@chschulze
Copy link

Same situation here. DGX + Slurm + Nvidia kernel + pyxis/enroot. Last update broke starting containers. Can confirm that downgrading the container related pkgs to 1.17.6-1 works.

tpdownes added a commit to tpdownes/cluster-toolkit that referenced this issue May 19, 2025
The 1.17.7 release of nvidia-container-toolkit contains a regression
which breaks running GPU-enabled jobs under enroot in Slurm.

- NVIDIA/nvidia-container-toolkit#1091
- NVIDIA/nvidia-container-toolkit#1093
- NVIDIA/enroot#232

While we wait for an updated package, this configuration will block
clusters from installing or upgrading to this package. If it is already
installed, this change does nothing. It should be forward-compatible in
the sense that it will not block new releases with sementically higher
versions.

This mitigates GoogleCloudPlatform#4144
tpdownes added a commit to tpdownes/cluster-toolkit that referenced this issue May 20, 2025
The 1.17.7 release of nvidia-container-toolkit contains a regression
which breaks running GPU-enabled jobs under enroot in Slurm.

- NVIDIA/nvidia-container-toolkit#1091
- NVIDIA/nvidia-container-toolkit#1093
- NVIDIA/enroot#232

While we wait for an updated package, this configuration will block
clusters from installing or upgrading to this package. If it is already
installed, this change does nothing. It should be forward-compatible in
the sense that it will not block new releases with sementically higher
versions.

This mitigates GoogleCloudPlatform#4144
@elezar
Copy link
Member

elezar commented May 20, 2025

There was a regression introduced in v1.17.7. The issue was identified and fixed in NVIDIA/libnvidia-container#310. We will include the fix in our next point release.

jvilarru pushed a commit to jvilarru/hpc-toolkit that referenced this issue May 22, 2025
The 1.17.7 release of nvidia-container-toolkit contains a regression
which breaks running GPU-enabled jobs under enroot in Slurm.

- NVIDIA/nvidia-container-toolkit#1091
- NVIDIA/nvidia-container-toolkit#1093
- NVIDIA/enroot#232

While we wait for an updated package, this configuration will block
clusters from installing or upgrading to this package. If it is already
installed, this change does nothing. It should be forward-compatible in
the sense that it will not block new releases with sementically higher
versions.

This mitigates GoogleCloudPlatform#4144
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants