Skip to content

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented Mar 13, 2025

First PR to start stack for NVIDIA Grace. See #967 for notes & coordination.

@trz42 trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io grace NVIDIA Grace CPU labels Mar 13, 2025
Copy link

eessi-bot bot commented Mar 13, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphirerapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Mar 13, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@eessi-bot-trz42
Copy link

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@trz42
Copy link
Collaborator Author

trz42 commented Mar 13, 2025

First attempt to verify if it actually builds and then also if the upload with signing works...
bot: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace

Copy link

eessi-bot bot commented Mar 13, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Mar 13, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Mar 13, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace from trz42

    • expanded format: build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace
  • handling command build instance:trz42-GH200-jr repository:eessi.io-2023.06-software architecture:aarch64/nvidia/grace resulted in:

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Mar 13, 2025

New job on instance trz42-GH200-jr for CPU micro-architecture aarch64-nvidia-grace for repository eessi.io-2023.06-software in job dir /p/project1/ceasybuilders/bot-trz42/jobs/2025.03/pr_968/13510000

  • test step below failed because ReFrame is not available in the stack for Grace yet
date job status comment
Mar 13 19:08:08 UTC 2025 submitted job id 13510000 awaits release by job manager
Mar 13 19:08:38 UTC 2025 released job awaits launch by Slurm scheduler
Mar 14 12:00:44 UTC 2025 running job 13510000 is running
Mar 14 12:14:14 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-13510000.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gzsize: 118 MiB (124619768 bytes)
entries: 196847
modules under 2023.06/software/linux/aarch64/nvidia/grace/modules/all
EasyBuild/4.8.2.lua
EasyBuild/4.9.0.lua
EasyBuild/4.9.1.lua
EasyBuild/4.9.2.lua
EasyBuild/4.9.3.lua
EasyBuild/4.9.4.lua
EESSI-extend/2023.06-easybuild.lua
software under 2023.06/software/linux/aarch64/nvidia/grace/software
EasyBuild/4.8.2
EasyBuild/4.9.0
EasyBuild/4.9.1
EasyBuild/4.9.2
EasyBuild/4.9.3
EasyBuild/4.9.4
EESSI-extend/2023.06-easybuild
other under 2023.06/software/linux/aarch64/nvidia/grace
.lmod/lmodrc.lua
.lmod/SitePackage.lua
Mar 14 12:14:14 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-13510000.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Mar 14 13:37:38 UTC 2025 not uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket failed
Mar 14 14:18:06 UTC 2025 not uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket failed
Mar 14 14:28:39 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 14 21:46:51 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 14 21:50:58 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 18 05:25:43 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 18 05:48:19 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 18 09:43:24 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 19 11:43:16 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded
Mar 21 22:03:12 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz to S3 bucket succeeded

@trz42
Copy link
Collaborator Author

trz42 commented Mar 14, 2025

Build looks ok. Testing signing & upload to different S3 bucket used for development.

@trz42 trz42 added the bot:deploy Ask bot to deploy missing software installations to EESSI label Mar 14, 2025
@eessi-bot-trz42
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

1 similar comment
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 14, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 14, 2025

Uploading failed because some Lua initialisation scripts weren't available inside the container. Trying if argument --contain helps in not running these initialisation scripts.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 14, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 14, 2025

--contain seems to help. Trying to fix some locale issue and providing missing S3 access credentials to perform the actual uploads.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 14, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 14, 2025

Upload again with updated sign script (using namespace).

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 14, 2025
@trz42
Copy link
Collaborator Author

trz42 commented Mar 18, 2025

Updated bot instance with code from EESSI/eessi-bot-software-layer#308 and reconfigured it to upload to S3 bucket on minio server (used for testing). Resetting deploy label to verify if updated bot code still works.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 18, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 18, 2025

Signature already existed. Recreating it.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 18, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@boegel
Copy link
Contributor

boegel commented Mar 18, 2025

Signature already existed. Recreating it.

How?

@trz42
Copy link
Collaborator Author

trz42 commented Mar 18, 2025

Redeploying to S3 test bucket.

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 18, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 19, 2025

Verifying if the updated upload script takes care of pre-existing signature files (by deleting them before running the sign script). For ref see EESSI/eessi-bot-software-layer#309

Re-setting deploy label

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 19, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@trz42
Copy link
Collaborator Author

trz42 commented Mar 19, 2025

Seems it works...

  • bot log (pyghee.log) contains a line with INFO: removed existing signature file (/.../2025.03/pr_968/13510000/eessi-2023.06-software-linux-aarch64-nvidia-grace-1 741954310.tar.gz.sig)
  • The above signature was manually changed before the test to only include the string foo.
  • The uploaded files (including the signature for the tarball) have the following metadata (note time and size)
    2025-03-19 12:43:11  124619768 eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz
    2025-03-19 12:43:16        663 eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz.meta.txt
    2025-03-19 12:43:15        878 eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz.meta.txt.sig
    2025-03-19 12:43:11        878 eessi-2023.06-software-linux-aarch64-nvidia-grace-1741954310.tar.gz.sig
    

@trz42
Copy link
Collaborator Author

trz42 commented Mar 21, 2025

Ok, let's deploy this to the default S3 to get it ingested into EESSI...

@trz42 trz42 added bot:deploy Ask bot to deploy missing software installations to EESSI and removed bot:deploy Ask bot to deploy missing software installations to EESSI labels Mar 21, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user trz42, but this person does not have permission to trigger deployments

@bedroge
Copy link
Collaborator

bedroge commented Mar 21, 2025

Staging PR merged.

@bedroge bedroge merged commit 7250444 into EESSI:2023.06-software.eessi.io Mar 21, 2025
59 checks passed
Copy link

eessi-bot bot commented Mar 21, 2025

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.03.21

1 similar comment
Copy link

eessi-bot bot commented Mar 21, 2025

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.03.21

@eessi-bot-trz42
Copy link

PR merged! Moved ['/p/project1/ceasybuilders/bot-trz42/jobs/2025.03/pr_968/13510000'] to /p/project1/ceasybuilders/bot-trz42/trash_bin/EESSI/software-layer/2025.03.21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io bot:deploy Ask bot to deploy missing software installations to EESSI grace NVIDIA Grace CPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants