Skip to content

Rebuild all CUDA software with EB-5.1.1 #1147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

casparvl
Copy link
Collaborator

@casparvl casparvl commented Aug 5, 2025

There are two reasons for this:

  1. Now that we have a CUDA sanity check, this allows us to see if anything is 'broken'.
  2. The PR that enables CI to check for differences between CUDA stacks at Add CUDA software check to stack comparison CI #1087 shows there are many differences between the architectures. In fact, there are so many holes that a rebuild PR for all architectures is probably the easiest way to fill all the gaps (much easier that figuring out what's missing for which of the 37 combinations of CPU+GPU).

…y check, so we can see if anything is 'broken'. Also, there are so many 'holes' in which software is present for which combination of CPU+GPU, that this is a convenient way to fill the gaps
@casparvl
Copy link
Collaborator Author

casparvl commented Aug 5, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/intel/icelake accelerator:nvidia/cc80

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 5, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 5, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/13574555

date job status comment
Aug 05 14:33:01 UTC 2025 submitted job id 13574555 will be eligible to start in about 20 seconds
Aug 05 14:33:15 UTC 2025 received job awaits launch by Slurm scheduler
Aug 05 14:33:28 UTC 2025 running job 13574555 is running
Aug 05 14:35:12 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13574555.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 05 14:35:12 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13574555.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 5, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 5, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/13575145

date job status comment
Aug 05 14:50:14 UTC 2025 submitted job id 13575145 will be eligible to start in about 20 seconds
Aug 05 14:50:19 UTC 2025 received job awaits launch by Slurm scheduler
Aug 05 14:50:43 UTC 2025 running job 13575145 is running
Aug 05 14:52:27 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13575145.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 05 14:52:27 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13575145.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 6, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/13593270

date job status comment
Aug 06 09:07:57 UTC 2025 submitted job id 13593270 will be eligible to start in about 20 seconds
Aug 06 09:08:01 UTC 2025 received job awaits launch by Slurm scheduler
Aug 06 09:08:25 UTC 2025 running job 13593270 is running
Aug 06 09:14:48 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13593270.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17544716010.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Aug 06 09:14:48 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13593270.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 6, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/13593590

date job status comment
Aug 06 09:19:43 UTC 2025 submitted job id 13593590 will be eligible to start in about 20 seconds
Aug 06 09:19:54 UTC 2025 received job awaits launch by Slurm scheduler
Aug 06 09:20:37 UTC 2025 running job 13593590 is running
Aug 06 09:28:46 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13593590.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17544724340.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Aug 06 09:28:46 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13593590.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

Hmmm, CUDA builds fail with:

== sanity checking...
  >> file 'bin/fatbinary' found: FAILED
  >> file 'bin/nvcc' found: FAILED
  >> file 'bin/nvlink' found: FAILED
  >> file 'bin/ptxas' found: FAILED
  >> file 'lib64/libcublas.so' found: OK
  >> file 'lib64/libcudart.so' found: OK
  >> file 'lib64/libcufft.so' found: OK
  >> file 'lib64/libcurand.so' found: OK
  >> file 'lib64/libcusparse.so' found: OK
  >> file 'lib/libcublas.so' found: OK
  >> file 'lib/libcudart.so' found: OK
  >> file 'lib/libcufft.so' found: OK
  >> file 'lib/libcurand.so' found: OK
  >> file 'lib/libcusparse.so' found: OK
  >> file 'extras/CUPTI/lib64/libcupti.so' found: OK
  >> file 'pkgconfig/cublas.pc' found: FAILED
  >> file 'pkgconfig/cudart.pc' found: FAILED
  >> file 'pkgconfig/cuda.pc' found: FAILED
  >> (non-empty) directory 'include' found: OK
  >> (non-empty) directory 'extras/CUPTI/include' found: OK
  >> loading modules: CUDA/12.1.1...

Those are the files that are symlinked from host-injections, probably (at least bin/nvcc is for sure). I guess the symlinks are broken for some reason?

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

Ah, found the issue:

== 2025-08-06 11:23:55,548 eb_hooks.py:1301 DEBUG nvcc is not found in allowlist, so replacing it with symlink: /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc
== 2025-08-06 11:23:55,550 filetools.py:358 INFO Symlinked /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc to /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc

Note that in the host-injections dir, the whole accel/nvidia/cc90 pat should be stripped. I.e. it should symlink

/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc

but

/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4/software/CUDA/12.1.1/bin/nvcc

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

https://github.com/EESSI/software-layer-scripts/blob/41f3775bfe214ecc51af2ea88f914d93414ed87b/eb_hooks.py#L1310 this is the line where it happens. Might actually be an issue with the setting of the EESSI_ACCELERATOR_TARGET. I'm not 100% sure what kind of value is expected there, but looking on our GPU nodes:

EESSI_ACCEL_SUBDIR=accel/nvidia/cc80
EESSI_ACCELERATOR_TARGET=accel/nvidia/cc80

It seems strange that both are identical, I think the code expected nvidia/cc80 instead? I'll need to figure out where this gets set, and if it changed recently.

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

I think the bug is here. The "/accel/%s" % accel_subdir will essentially create e.g. /accel/accel/cuda/cc80, since accel_subdir is something like accel/cuda/cc80 (i.e. equal to the EESSI_ACCELERATOR_TARGET).

@ocaisa
Copy link
Member

ocaisa commented Aug 6, 2025

@casparvl you are correct, the bot was previously setting the accelerator override in a way that did not include the accel/ (but archdetect does include this top level directory). It worked because the incorrect value was consistently used. I thought I fixed it everywhere but I clearly missed this one

@casparvl casparvl marked this pull request as draft August 12, 2025 15:06
@casparvl
Copy link
Collaborator Author

This PR is on hold until EESSI/software-layer-scripts#59 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants