Skip to content

Create utils function for nvidia-smi check #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 5, 2025

Conversation

laraPPr
Copy link
Contributor

@laraPPr laraPPr commented Jun 24, 2025

No description provided.

@boegel
Copy link
Contributor

boegel commented Jun 24, 2025

@laraPPr Ignore the failing CI check for now, I'm trying to fix that in #9 ...

@laraPPr laraPPr changed the title Create for utils function nvidia-smi check Create utils function for nvidia-smi check Jun 25, 2025
@ocaisa
Copy link
Member

ocaisa commented Jul 31, 2025

A sync with main is probably a good idea here

@ocaisa
Copy link
Member

ocaisa commented Jul 31, 2025

I think you can have a couple of tests for this, there's mocking for nvidia-smi in https://github.com/laraPPr/software-layer-scripts/blob/utils_function_nvidia/.github/workflows/tests_link_nvidia_host_libraries.yml#L37 . You could add the test there.

@laraPPr laraPPr mentioned this pull request Aug 1, 2025
BUILD_STEP_ARGS+=("--nvidia" "all")
elif [ ${ec} -eq 1 ]; then
BUILD_STEP_ARGS+=("--nvidia" "install")
elif [ ${ec} -eq 2 ]; then
BUILD_STEP_ARGS+=("--nvidia" "install")
Copy link
Contributor Author

@laraPPr laraPPr Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now exactly mimicking the behavior but I'm not sure this is correct.
This is now set in this case No 'nvidia-smi' found, no available GPU but allowing overriding this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed now always set. I do not think this is what we want, right. exerpt of logs from EESSI/software-layer#1143

No 'nvidia-smi' found, no available GPU but allowing overriding this check

Executing command to build software:

/project/60006/SHARED/jobs/2025.08/pr_1107/event_cfecebd0-6eb2-11f0-9bd7-28ae4dcb0ae2/run_000/linux_x86_64_intel_sapphirerapids/eessi.io-2023.06-software/software-layer-scripts/eessi_container.sh 
--verbose --access rw --mode run --container docker://ghcr.io/eessi/build-node:debian12 
--repository eessi.io-2023.06-software --extra-bind-paths /project/60006/SHARED/jobs/2025.08/pr_1107/event_cfecebd0-6eb2-11f0-9bd7-28ae4dcb0ae2/run_000/linux_x86_64_intel_sapphirerapids/eessi.io-2023.06-software/software-layer-scripts,/dev 
--pass-through --contain --save /project/60006/SHARED/jobs/2025.08/pr_1107/event_cfecebd0-6eb2-11f0-9bd7-28ae4dcb0ae2/run_000/linux_x86_64_intel_sapphirerapids/eessi.io-2023.06-software/previous_tmp/build_step 
--storage /tmp/bot/EESSI/eessi_job.N27hQTIAbE --nvidia install 
--host-injections /project/def-users/bot/shared/host-injections

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our scripts this is fine, it is a once only thing, once things are installed they are not reinstalled.

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, a CI check would be great. You can do it in https://github.com/EESSI/software-layer-scripts/blob/main/.github/workflows/tests_link_nvidia_host_libraries.yml, check once before nvidia-smi is mocked that it returns non-zero, and once after it has been mocked

@laraPPr
Copy link
Contributor Author

laraPPr commented Aug 4, 2025

Implemented the CI see https://github.com/EESSI/software-layer-scripts/actions/runs/16718197861/job/47316223515?pr=22 and I think it does what is expected?

@ocaisa
Copy link
Member

ocaisa commented Aug 4, 2025

This looks good, can you sync it with main, I think the broken CI should have already been fixed there since Friday

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @laraPPr

@ocaisa ocaisa merged commit 599e031 into EESSI:main Aug 5, 2025
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants