-
Notifications
You must be signed in to change notification settings - Fork 351
[rpm] libnvidia-container-tools should pin to nvidia-container-toolkit version #1091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm. - NVIDIA/nvidia-container-toolkit#1091 - NVIDIA/nvidia-container-toolkit#1093 - NVIDIA/enroot#232 While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions. This mitigates GoogleCloudPlatform#4144
The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm. - NVIDIA/nvidia-container-toolkit#1091 - NVIDIA/nvidia-container-toolkit#1093 - NVIDIA/enroot#232 While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions. This mitigates GoogleCloudPlatform#4144
The 1.17.7 release of nvidia-container-toolkit contains a regression which breaks running GPU-enabled jobs under enroot in Slurm. - NVIDIA/nvidia-container-toolkit#1091 - NVIDIA/nvidia-container-toolkit#1093 - NVIDIA/enroot#232 While we wait for an updated package, this configuration will block clusters from installing or upgrading to this package. If it is already installed, this change does nothing. It should be forward-compatible in the sense that it will not block new releases with sementically higher versions. This mitigates GoogleCloudPlatform#4144
Hi @seemethere thanks for bringing this to our attention. In the past, the release cycles of the nvidia-container-toolkit and libnvidia-container packages were not as tightly coupled as they are today. The "at least" dependency made it possible for users to update their libnvidia-container packages once new versions were available. As of the v1.6.0 release (https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.6.0) this is no longer the case, and we effectively release the libnvidia-container packages as part of the NVIDIA Container Toolkit with the same version used in both cases. Given this, being more restrictive on the version requirements makes sense and we will follow-up on this internally. |
I wanted to add a note for completeness regarding the user experience of the proposed change for our Debian packages. Since we already have a strictly equals dependency between the In the case of the rpm packages, the dependency is properly resolved and running:
also installs the In the case of debian packages, however, running:
shows the following output:
Since |
nvidia-container-toolkit/packaging/rpm/SPECS/nvidia-container-toolkit.spec
Lines 24 to 25 in ac8f190
The latest release of nvidia-container-toolkit bricked a lot of jobs on pytorch's CUDA CI (see example log) because it mistakenly upgraded the following packages:
The issue manifested itself with containers being unable to access GPU resources and thus we silently stopped running CUDA CI altogether (this will be remedied by pytorch/test-infra#6638).
I'm creating this issue more as a discussion point to check int to see if these dependencies can be pinned.
If they can be pinned I'll happily submit over a PR but wanted to get context on why they were not before in the past and if we should have a reasonable expectation that mismatched versions of these packages should work.
The text was updated successfully, but these errors were encountered: