With most recent Ubuntu packages upgrade, enroot container load fails #232

itzsimpl · 2025-05-17T14:17:32Z

We have a DGX H100 system, and we're running Slurm with latest Enroot/Pyxis. Since the most recent upgrade of the nvidia kernel, nvidia-container-toolkit and other ubuntu packages Enroot fails to load containers.

Log from apt upgrade

Start-Date: 2025-05-17  11:06:31
Commandline: apt upgrade -y
Requested-By: ubuntu (1000)
Install: linux-image-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-tools-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-modules-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-nvidia-tools-5.15.0-1078:amd64 (5.15.0-1078.79, automatic), linux-modules-extra-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-headers-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-modules-nvidia-fs-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-nvidia-headers-5.15.0-1078:amd64 (5.15.0-1078.79, automatic)
Upgrade: linux-tools-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), openjdk-11-jre:amd64 (11.0.26+4-1ubuntu1~22.04, 11.0.27+6~us1-0ubuntu1~22.04), linux-image-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), python2.7-minimal:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), linux-tools-common:amd64 (5.15.0-139.149, 5.15.0-140.150), openjdk-11-jre-headless:amd64 (11.0.26+4-1ubuntu1~22.04, 11.0.27+6~us1-0ubuntu1~22.04), libldap-common:amd64 (2.5.18+dfsg-0ubuntu0.22.04.3, 2.5.19+dfsg-0ubuntu0.22.04.1), libnvidia-container1:amd64 (1.17.6-1, 1.17.7-1), libldap-2.5-0:amd64 (2.5.18+dfsg-0ubuntu0.22.04.3, 2.5.19+dfsg-0ubuntu0.22.04.1), linux-crashdump:amd64 (5.15.0.139.135, 5.15.0.140.135), linux-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), linux-headers-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), open-vm-tools:amd64 (2:12.3.5-3~ubuntu0.22.04.1, 2:12.3.5-3~ubuntu0.22.04.2), libnvidia-container-tools:amd64 (1.17.6-1, 1.17.7-1), nvidia-container-toolkit:amd64 (1.17.6-1, 1.17.7-1), nvidia-container-toolkit-base:amd64 (1.17.6-1, 1.17.7-1), python2.7:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), libpython2.7-minimal:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), libpython2.7-stdlib:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), linux-libc-dev:amd64 (5.15.0-139.149, 5.15.0-140.150)
End-Date: 2025-05-17  11:09:05

# nvidia-container-toolkit -version
NVIDIA Container Runtime Hook version 1.17.7
commit: bae3e7842ebe26812d8bd6a9be6a14a83dc91d8f

The error is

# srun -c8 --mem 16G --gpus 1 --container-image nvcr.io/nvidia/nemo:25.04 --pty bash
pyxis: importing docker image: nvcr.io/nvidia/nemo:25.04
pyxis: imported docker image: nvcr.io/nvidia/nemo:25.04
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: gn01: task 0: Exited with exit code 1

Adding no_cgroups = true to /etc/nvidia-container-runtime/config.toml like https://docs.nvidia.com/ai-enterprise/deployment/cpu-only/latest/runtimes.html#rootless-container-setup-optional and NVIDIA/libnvidia-container#154, does not help.

The text was updated successfully, but these errors were encountered:

itzsimpl · 2025-05-17T15:47:10Z

FYI. Definitely an issue with one of libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit nvidia-container-toolkit-base, as downgrading them to 1.17.6-1 resolves the issue.

chschulze · 2025-05-19T11:30:19Z

Same situation here. DGX + Slurm + Nvidia kernel + pyxis/enroot. Last update broke starting containers. Can confirm that downgrading the container related pkgs to 1.17.6-1 works.

The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm. - NVIDIA/nvidia-container-toolkit#1091 - NVIDIA/nvidia-container-toolkit#1093 - NVIDIA/enroot#232 While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions. This mitigates GoogleCloudPlatform#4144

elezar · 2025-05-20T18:08:05Z

There was a regression introduced in v1.17.7. The issue was identified and fixed in NVIDIA/libnvidia-container#310. We will include the fix in our next point release.

The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm. - NVIDIA/nvidia-container-toolkit#1091 - NVIDIA/nvidia-container-toolkit#1093 - NVIDIA/enroot#232 While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions. This mitigates GoogleCloudPlatform#4144

chschulze mentioned this issue May 19, 2025

Update to version 1.17.7-1 on Ubuntu breaks pyxis/enroot in Slurm NVIDIA/nvidia-container-toolkit#1093

Open

tpdownes mentioned this issue May 19, 2025

GPU-enabled enroot containers fail after updating to latest nvidia-container-toolkit GoogleCloudPlatform/cluster-toolkit#4144

Closed

tpdownes mentioned this issue May 19, 2025

Block broken release of nvidia-container-toolkit GoogleCloudPlatform/cluster-toolkit#4145

Merged

elezar mentioned this issue May 20, 2025

Do not discard container flags when --cuda-compat-mode is not specified NVIDIA/libnvidia-container#310

Merged

tpdownes mentioned this issue May 20, 2025

Block broken release of nvidia-container-toolkit GoogleCloudPlatform/cluster-toolkit#4152

Merged

d-schneid mentioned this issue May 20, 2025

Container does not start anymore - Permission denied #234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

With most recent Ubuntu packages upgrade, enroot container load fails #232

With most recent Ubuntu packages upgrade, enroot container load fails #232

itzsimpl commented May 17, 2025

itzsimpl commented May 17, 2025

Uh oh!

chschulze commented May 19, 2025

Uh oh!

elezar commented May 20, 2025

Uh oh!

With most recent Ubuntu packages upgrade, enroot container load fails #232

With most recent Ubuntu packages upgrade, enroot container load fails #232

Comments

itzsimpl commented May 17, 2025

itzsimpl commented May 17, 2025

Uh oh!

chschulze commented May 19, 2025

Uh oh!

elezar commented May 20, 2025

Uh oh!