-
Notifications
You must be signed in to change notification settings - Fork 10
Adapt subdir for CUDA toolkit in host injections #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…at it also doesnt include the CPU microarchitecture
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
New job on instance
|
Hmmm, success, but not what I planned. Installdir for the
I wanted it to be I guess the
The odd thing is that this should have broken the sanity check for installing CUDA in the |
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
New job on instance
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
New job on instance
|
That's more like it!
Now I still need to carefully check the symlinks for the installations, to make sure they also refer here (because the old location also still contains CUDA, so it wouldn't lead to a broken install - making any mistakes harder to spot). |
Yep, symlinks are still 'wrong', pointing to the old location:
I'll check further tomorrow. The EB build log will probably show some output form the eb_hooks. |
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
New job on instance
|
That looks better:
|
…to e.g. /cvmfs/software.eessi.io/host_injections/x86_64, i.e. only include the CPU family in the prefix, not microarchitecture or accelerator architecture. Since these are binary installs, we don't need multiple copies, and requiring site admins to run the install scripts once per micro-architecture is just annoying (and requires more storage)
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
New job on instance
|
…DNN package was found in the old host-injections location (with micro-arch specific subdir). Also, adapt the path to search for the regular LmodError
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
New job on instance
|
Hmmm, that's strange. This directory is writeable:
|
Also:
That's really strange, it looks like the issue I had before when the bind-mounting became the default, except: the repo is really fuse-mounted here:
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
New job on instance
|
Hm, issue might have been two bot jobs trying at the same time. I cleaned out the bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
Stupid, if multiple installations update things in Also, these builds take forever. Note sure if this is related to the slowness that @ocaisa experienced, but it's... bad. Edit: might be due to gzip being slow, we should really look into deploying Edit2: for |
Relaunching what has failed so far due to builds encountering a lock-file in host-injections, but no complete CUDA install yet. The rest of the builds should complete succesfully, since they were started after another build had already completed its install in host-injections. bot: build repo:eessi.io-2023.06-software instance:eessi-bot-jsc architecture:aarch64/nvidia/grace accelerator:nvidia/cc80 |
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
Strange, on
But it's also strange that the bot did not report the start of the job. It made me think it maybe an issue with some update of the EESSI module not being deployed for this architecture, but #59 (comment) completed successfully, so that can't be the case. I'll retry once more, it may just be some strange hickup by the bot (also considering the fact that it didn't report the start of the build. bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:cascadelake accelerator:nvidia/cc90 |
New job on instance
|
Strange,
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws arch:icelake accelerator:nvidia/cc90 |
New job on instance
|
Same failure for
I'm really not sure what's causing this, as it succeeds for the same CPU arch + different accelerator arch, so it almost can't be a problem with an |
Try to change the subdir in which the CUDA toolkit is installed so that it also doesn't include the CPU microarchitecture