Skip to content

{2023.06}[2023a,2023b] rebuild CUDA/* module files (take 2) #919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented Feb 17, 2025

Renewed version for #918

After easybuilders/easybuild-easyblocks#3516 got merged we need to update the module files for CUDA/12.{1.1,4.0}

We need to do that for the architecture combinations:

  • zen2 + cc80
  • zen3 + cc80
  • zen4 + cc90

For the first two we use the build cluster on AWS. For the third we use the build cluster on Azure. Because CUDA is just a binary installation, this should be fine.

Note, while we only need to rebuild the module files, we cannot use --module-only as EasyBuild argument because the rebuild procedure removes the whole installation.

@trz42 trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia labels Feb 17, 2025
Copy link

eessi-bot bot commented Feb 17, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@riscv-eessi-io-bot
Copy link

Instance eessi-bot-riscv is configured to build for:

  • architectures: riscv64/generic
  • repositories: riscv.eessi.io-20240402

Copy link

eessi-bot bot commented Feb 17, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 17, 2025

Instance eessi-bot-vsc-ugent is configured to build for:

  • architectures: x86_64/amd/zen3
  • repositories: eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-surf
Copy link

Instance eessi-bot-surf is configured to build for:

  • architectures: x86_64/amd/zen4, x86_64/amd/zen2
  • repositories: eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-toprichard
Copy link

Instance rt-Grace-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@trz42
Copy link
Collaborator Author

trz42 commented Feb 17, 2025

Just give it a try...

bot: build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90
bot: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
bot: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80

Copy link

eessi-bot bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90
  • received bot command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 from trz42

    • expanded format: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
  • received bot command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 from trz42

    • expanded format: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted
  • handling command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:

  • handling command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account trz42 has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90
  • received bot command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 from trz42

    • expanded format: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
  • received bot command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 from trz42

    • expanded format: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 resulted in:

  • handling command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@eessi-bot-surf
Copy link

Updates by the bot instance eessi-bot-surf (click for details)
  • account trz42 has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 from trz42

    • expanded format: build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90
  • received bot command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 from trz42

    • expanded format: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
  • received bot command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 from trz42

    • expanded format: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-mc-azure repository:eessi.io-2023.06-software architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted
  • handling command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted
  • handling command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Feb 17, 2025

New job on instance eessi-bot-mc-azure for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.02/pr_919/1085

date job status comment
Feb 17 12:50:41 UTC 2025 submitted job id 1085 awaits release by job manager
Feb 17 12:50:51 UTC 2025 released job awaits launch by Slurm scheduler
Feb 17 12:56:54 UTC 2025 running job 1085 is running
Feb 17 13:56:16 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-1085.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-1739798555.tar.gzsize: 4373 MiB (4585648527 bytes)
entries: 11757
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
CUDA/12.1.1
CUDA/12.4.0
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Feb 17 13:56:16 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-1085.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Feb 18 09:18:47 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen4-1739798555.tar.gz to S3 bucket succeeded

Copy link

eessi-bot bot commented Feb 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.02/pr_919/46531

date job status comment
Feb 17 12:50:42 UTC 2025 submitted job id 46531 awaits release by job manager
Feb 17 12:51:39 UTC 2025 released job awaits launch by Slurm scheduler
Feb 17 12:59:44 UTC 2025 running job 46531 is running
Feb 17 14:01:56 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-46531.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1739798896.tar.gzsize: 4373 MiB (4585650573 bytes)
entries: 11757
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
CUDA/12.1.1
CUDA/12.4.0
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
no other files in tarball
Feb 17 14:01:56 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-46531.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Feb 18 09:18:35 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen3-1739798896.tar.gz to S3 bucket succeeded

Copy link

eessi-bot bot commented Feb 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.02/pr_919/46532

  • test step failed with
    ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'
    Log file(s) saved in '/tmp/tmp.skuL75P5X3/rfm-0an7_lvk.log'
    ESC[31mERROR: Failed to list ReFrame tests with command: reframe --tag CI --tag 1_node  --nocolor -n EESSI_OSU -n EESSI_LAMMPS --listESC[0m
    
date job status comment
Feb 17 12:50:46 UTC 2025 submitted job id 46532 awaits release by job manager
Feb 17 12:51:37 UTC 2025 released job awaits launch by Slurm scheduler
Feb 17 12:59:42 UTC 2025 running job 46532 is running
Feb 17 14:11:08 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-46532.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1739799110.tar.gzsize: 4373 MiB (4585663906 bytes)
entries: 11757
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
CUDA/12.1.1
CUDA/12.4.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
no other files in tarball
Feb 17 14:11:08 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-46532.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Feb 18 09:19:37 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen2-1739799110.tar.gz to S3 bucket succeeded

@trz42
Copy link
Collaborator Author

trz42 commented Feb 17, 2025

All rebuilt successfully.

@trz42 trz42 added the ready-to-deploy Mark a PR as ready to deploy label Feb 17, 2025
@bedroge bedroge added bot:deploy Ask bot to deploy missing software installations to EESSI and removed ready-to-deploy Mark a PR as ready to deploy labels Feb 18, 2025
@eessi-bot-surf
Copy link

Label bot:deploy has been set by user bedroge, but this person does not have permission to trigger deployments

1 similar comment
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user bedroge, but this person does not have permission to trigger deployments

@bedroge
Copy link
Collaborator

bedroge commented Feb 18, 2025

Tarballs have been ingested (and I've removed the log files of the old installations).

@bedroge bedroge merged commit 464dcba into EESSI:2023.06-software.eessi.io Feb 18, 2025
49 checks passed
Copy link

eessi-bot bot commented Feb 18, 2025

PR merged! Moved ['/project/def-users/SHARED/jobs/2025.02/pr_919/46531', '/project/def-users/SHARED/jobs/2025.02/pr_919/46532'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.02.18

@riscv-eessi-io-bot
Copy link

PR merged! Moved [] to /home/eessibot/shared/trash_bin/EESSI/software-layer/2025.02.18

Copy link

eessi-bot bot commented Feb 18, 2025

PR merged! Moved ['/project/def-users/SHARED/jobs/2025.02/pr_919/1085'] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.02.18

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 18, 2025

PR merged! Moved [] to /scratch/gent/vo/002/gvo00211/SHARED/trash_bin/EESSI/software-layer/2025.02.18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia bot:deploy Ask bot to deploy missing software installations to EESSI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants