Skip to content

[rpm] libnvidia-container-tools should pin to nvidia-container-toolkit version #1091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
seemethere opened this issue May 16, 2025 · 2 comments
Assignees

Comments

@seemethere
Copy link

Requires: libnvidia-container-tools >= %{libnvidia_container_tools_version}, libnvidia-container-tools < 2.0.0
Requires: nvidia-container-toolkit-base == %{version}-%{release}

The latest release of nvidia-container-toolkit bricked a lot of jobs on pytorch's CUDA CI (see example log) because it mistakenly upgraded the following packages:

  • nvidia-container-tools
  • libnvidia-container1
  • nvidia-container-toolkit-base

The issue manifested itself with containers being unable to access GPU resources and thus we silently stopped running CUDA CI altogether (this will be remedied by pytorch/test-infra#6638).

I'm creating this issue more as a discussion point to check int to see if these dependencies can be pinned.

If they can be pinned I'll happily submit over a PR but wanted to get context on why they were not before in the past and if we should have a reasonable expectation that mismatched versions of these packages should work.

tpdownes added a commit to tpdownes/cluster-toolkit that referenced this issue May 19, 2025
The 1.17.7 release of nvidia-container-toolkit contains a regression
which breaks running GPU-enabled jobs under enroot in Slurm.

- NVIDIA/nvidia-container-toolkit#1091
- NVIDIA/nvidia-container-toolkit#1093
- NVIDIA/enroot#232

While we wait for an updated package, this configuration will block
clusters from installing or upgrading to this package. If it is already
installed, this change does nothing. It should be forward-compatible in
the sense that it will not block new releases with sementically higher
versions.

This mitigates GoogleCloudPlatform#4144
tpdownes added a commit to tpdownes/cluster-toolkit that referenced this issue May 20, 2025
The 1.17.7 release of nvidia-container-toolkit contains a regression
which breaks running GPU-enabled jobs under enroot in Slurm.

- NVIDIA/nvidia-container-toolkit#1091
- NVIDIA/nvidia-container-toolkit#1093
- NVIDIA/enroot#232

While we wait for an updated package, this configuration will block
clusters from installing or upgrading to this package. If it is already
installed, this change does nothing. It should be forward-compatible in
the sense that it will not block new releases with sementically higher
versions.

This mitigates GoogleCloudPlatform#4144
@elezar elezar self-assigned this May 22, 2025
jvilarru pushed a commit to jvilarru/hpc-toolkit that referenced this issue May 22, 2025
The 1.17.7 release of nvidia-container-toolkit contains a regression
which breaks running GPU-enabled jobs under enroot in Slurm.

- NVIDIA/nvidia-container-toolkit#1091
- NVIDIA/nvidia-container-toolkit#1093
- NVIDIA/enroot#232

While we wait for an updated package, this configuration will block
clusters from installing or upgrading to this package. If it is already
installed, this change does nothing. It should be forward-compatible in
the sense that it will not block new releases with sementically higher
versions.

This mitigates GoogleCloudPlatform#4144
@elezar
Copy link
Member

elezar commented May 26, 2025

Hi @seemethere thanks for bringing this to our attention.

In the past, the release cycles of the nvidia-container-toolkit and libnvidia-container packages were not as tightly coupled as they are today. The "at least" dependency made it possible for users to update their libnvidia-container packages once new versions were available.

As of the v1.6.0 release (https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.6.0) this is no longer the case, and we effectively release the libnvidia-container packages as part of the NVIDIA Container Toolkit with the same version used in both cases. Given this, being more restrictive on the version requirements makes sense and we will follow-up on this internally.

@elezar
Copy link
Member

elezar commented May 27, 2025

I wanted to add a note for completeness regarding the user experience of the proposed change for our Debian packages. Since we already have a strictly equals dependency between the nvidia-container-toolkit and nvidia-container-toolkit-base package, we can use this as a demonstrator.

In the case of the rpm packages, the dependency is properly resolved and running:

dnf install nvidia-container-toolkit-1.16.2-1

also installs the nvidia-container-toolkit-base-1.16.2-1 package. (it stands to reason that this can be done for the libnvidia-container-tools and libnvidia-container1 packages as well to ensure that they need be explicitly pinned to the same version.

In the case of debian packages, however, running:

$ apt-get install nvidia-container-toolkit=1.16.2-1

shows the following output:

Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-container-toolkit : Depends: nvidia-container-toolkit-base (= 1.16.2-1) but 1.17.7-1 is to be installed
E: Unable to correct problems, you have held broken packages.

Since apt doesn't properly resolve the dependencies and forces users to specify the version of the nvidia-container-toolkit-base package in addition to the nvidia-contianer-toolkit package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants