You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Compiling with flags -fopenmp -g -fopenmp-targets=amd64 on an amd64 system results in the omp runtime returning 4 devices available for some strange reason, even if the current system has a single socket processor, so I don't really understand where that 4 is coming from.
Compiling with flags `-fopenmp -g -fopenmp-targets=amd64` on an amd64 system results in the omp runtime returning 4 devices available for some strange reason, even if the current system has a single socket processor, so I don't really understand where that 4 is coming from.
Sorry but I am not sure I understand.
Does it mean that if I were to install three GPUs on my system I would not see three devices available for offloading?
How can the number of devices be hardcoded while being useful?
For example, this would prevent splitting load on multiple devices.
-fopenmp-targets=amd64 means offloading to the CPU/host, and we use a very specific (and naive) implementation for that. If you want to offload to a GPU instead, you should use --offload-arch=gfx942 or --offload-arch=sm_90, depending on your actual GPU architecture. In that case, the number of devices is determined based on how many GPUs (or other offloading devices) support the given offload image. For example, if you have 8 NVIDIA GPUs but your offload target is gfx942, you'll end up with 0 devices. Similarly, even if you have 8 GPUs, regardless of vendor, but your offload target is amd64 (i.e., CPU), you'll get a fixed number of devices, which is hardcoded in our simple host-offloading implementation.
Got it. So the contribution to the count for GPUs is correct, but not for the CPU. And if one tries to mix them it will be wrong still. Like -fopenmp-targets=amd64,nvptx64 on my system results in 5 device found and not 2 (1 because of the nvidia card I have, the other four are because of amd64).
This is a problem for any kind of mixed application which does not want to explicitly split host and target computation. For example, I am trying to split a dataset for multiple devices and run a subtask on each according to a benchmark of their capabilities, but if the the CPU takes 4 slots any performance metric would end up skewed.
At this point I am just curious, why 4? I would assume 1 to be a more reasonable default if one has to really set some hardcoded value.
Activity
llvmbot commentedon Mar 22, 2025
@llvm/issue-subscribers-openmp
Author: None (KaruroChori)
Outputs
4 4
.shiltian commentedon Mar 22, 2025
That is because we hard code the number of devices for offloading to host.
KaruroChori commentedon Mar 22, 2025
Sorry but I am not sure I understand.
Does it mean that if I were to install three GPUs on my system I would not see three devices available for offloading?
How can the number of devices be hardcoded while being useful?
For example, this would prevent splitting load on multiple devices.
shiltian commentedon Mar 22, 2025
-fopenmp-targets=amd64
means offloading to the CPU/host, and we use a very specific (and naive) implementation for that. If you want to offload to a GPU instead, you should use--offload-arch=gfx942
or--offload-arch=sm_90
, depending on your actual GPU architecture. In that case, the number of devices is determined based on how many GPUs (or other offloading devices) support the given offload image. For example, if you have 8 NVIDIA GPUs but your offload target isgfx942
, you'll end up with 0 devices. Similarly, even if you have 8 GPUs, regardless of vendor, but your offload target isamd64
(i.e., CPU), you'll get a fixed number of devices, which is hardcoded in our simple host-offloading implementation.KaruroChori commentedon Mar 22, 2025
Got it. So the contribution to the count for GPUs is correct, but not for the CPU. And if one tries to mix them it will be wrong still. Like
-fopenmp-targets=amd64,nvptx64
on my system results in 5 device found and not 2 (1 because of the nvidia card I have, the other four are because of amd64).This is a problem for any kind of mixed application which does not want to explicitly split host and target computation. For example, I am trying to split a dataset for multiple devices and run a subtask on each according to a benchmark of their capabilities, but if the the CPU takes 4 slots any performance metric would end up skewed.
At this point I am just curious, why 4? I would assume 1 to be a more reasonable default if one has to really set some hardcoded value.