[LLVM] Fix offload and update CUDA ABI for all SM values #159354

jhuber6 · 2025-09-17T13:11:18Z

Summary:
Turns out the new CUDA ABI now applies retroactively to all the other
SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping
all the SM flags consistent but using an offset.

Fixes: #159088

llvmbot · 2025-09-17T13:11:53Z

@llvm/pr-subscribers-offload

@llvm/pr-subscribers-llvm-binary-utilities

Author: Joseph Huber (jhuber6)

Changes

Summary:
Turns out the new CUDA ABI now applies retroactively to all the other
SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping
all the SM flags consistent but using an offset.

Fixes: #159088

Full diff: https://github.com/llvm/llvm-project/pull/159354.diff

4 Files Affected:

(modified) llvm/include/llvm/BinaryFormat/ELF.h (+13-12)
(modified) llvm/lib/Object/ELFObjectFile.cpp (+2-1)
(modified) llvm/tools/llvm-readobj/ELFDumper.cpp (+50-12)
(modified) offload/plugins-nextgen/cuda/src/rtl.cpp (+1-1)

diff --git a/llvm/include/llvm/BinaryFormat/ELF.h b/llvm/include/llvm/BinaryFormat/ELF.h
index 749971e354f66..39c3549697ea3 100644
--- a/llvm/include/llvm/BinaryFormat/ELF.h
+++ b/llvm/include/llvm/BinaryFormat/ELF.h
@@ -931,6 +931,12 @@ enum : unsigned {
   // Processor selection mask for EF_CUDA_SM* values prior to blackwell.
   EF_CUDA_SM = 0xff,
 
+  // Processor selection mask for EF_CUDA_SM* values following blackwell.
+  EF_CUDA_SM_MASK = 0xff00,
+
+  // Processor selection mask for EF_CUDA_SM* values following blackwell.
+  EF_CUDA_SM_OFFSET = 8,
+
   // SM based processor values.
   EF_CUDA_SM20 = 0x14,
   EF_CUDA_SM21 = 0x15,
@@ -951,8 +957,13 @@ enum : unsigned {
   EF_CUDA_SM86 = 0x56,
   EF_CUDA_SM87 = 0x57,
   EF_CUDA_SM89 = 0x59,
-  // The sm_90a variant uses the same machine flag.
   EF_CUDA_SM90 = 0x5a,
+  EF_CUDA_SM100 = 0x64,
+  EF_CUDA_SM101 = 0x65,
+  EF_CUDA_SM103 = 0x67,
+  EF_CUDA_SM120 = 0x78,
+  EF_CUDA_SM121 = 0x79,
+
 
   // Unified texture binding is enabled.
   EF_CUDA_TEXMODE_UNIFIED = 0x100,
@@ -968,17 +979,7 @@ enum : unsigned {
   // Virtual processor selection mask for EF_CUDA_VIRTUAL_SM* values.
   EF_CUDA_VIRTUAL_SM = 0xff0000,
 
-  // Processor selection mask for EF_CUDA_SM* values following blackwell.
-  EF_CUDA_SM_MASK = 0xff00,
-
-  // SM based processor values.
-  EF_CUDA_SM100 = 0x6400,
-  EF_CUDA_SM101 = 0x6500,
-  EF_CUDA_SM103 = 0x6700,
-  EF_CUDA_SM120 = 0x7800,
-  EF_CUDA_SM121 = 0x7900,
-
-  // Set when using an accelerator variant like sm_100a.
+  // Set when using an accelerator variant like sm_100a in the new ABI.
   EF_CUDA_ACCELERATORS = 0x8,
 };
 
diff --git a/llvm/lib/Object/ELFObjectFile.cpp b/llvm/lib/Object/ELFObjectFile.cpp
index aff047c297cc2..92f5cef80ae18 100644
--- a/llvm/lib/Object/ELFObjectFile.cpp
+++ b/llvm/lib/Object/ELFObjectFile.cpp
@@ -622,7 +622,8 @@ StringRef ELFObjectFileBase::getNVPTXCPUName() const {
   assert(getEMachine() == ELF::EM_CUDA);
   unsigned SM = getEIdentABIVersion() == ELF::ELFABIVERSION_CUDA_V1
                     ? getPlatformFlags() & ELF::EF_CUDA_SM
-                    : getPlatformFlags() & ELF::EF_CUDA_SM_MASK;
+                    : (getPlatformFlags() & ELF::EF_CUDA_SM_MASK) >>
+                          ELF::EF_CUDA_SM_OFFSET;
 
   switch (SM) {
   // Fermi architecture.
diff --git a/llvm/tools/llvm-readobj/ELFDumper.cpp b/llvm/tools/llvm-readobj/ELFDumper.cpp
index 7d75f29623ea9..2f3909811b010 100644
--- a/llvm/tools/llvm-readobj/ELFDumper.cpp
+++ b/llvm/tools/llvm-readobj/ELFDumper.cpp
@@ -1114,6 +1114,7 @@ const EnumEntry<unsigned> ElfOSABI[] = {
     {"FenixOS", "FenixOS", ELF::ELFOSABI_FENIXOS},
     {"CloudABI", "CloudABI", ELF::ELFOSABI_CLOUDABI},
     {"CUDA", "NVIDIA - CUDA", ELF::ELFOSABI_CUDA},
+    {"CUDA", "NVIDIA - CUDA", ELF::ELFOSABI_CUDA_V2},
     {"Standalone", "Standalone App", ELF::ELFOSABI_STANDALONE}};
 
 const EnumEntry<unsigned> AMDGPUElfOSABI[] = {
@@ -1679,19 +1680,56 @@ const EnumEntry<unsigned> ElfHeaderAMDGPUFlagsABIVersion4[] = {
 };
 
 const EnumEntry<unsigned> ElfHeaderNVPTXFlags[] = {
-    ENUM_ENT(EF_CUDA_SM20, "sm_20"),   ENUM_ENT(EF_CUDA_SM21, "sm_21"),
-    ENUM_ENT(EF_CUDA_SM30, "sm_30"),   ENUM_ENT(EF_CUDA_SM32, "sm_32"),
-    ENUM_ENT(EF_CUDA_SM35, "sm_35"),   ENUM_ENT(EF_CUDA_SM37, "sm_37"),
-    ENUM_ENT(EF_CUDA_SM50, "sm_50"),   ENUM_ENT(EF_CUDA_SM52, "sm_52"),
-    ENUM_ENT(EF_CUDA_SM53, "sm_53"),   ENUM_ENT(EF_CUDA_SM60, "sm_60"),
-    ENUM_ENT(EF_CUDA_SM61, "sm_61"),   ENUM_ENT(EF_CUDA_SM62, "sm_62"),
-    ENUM_ENT(EF_CUDA_SM70, "sm_70"),   ENUM_ENT(EF_CUDA_SM72, "sm_72"),
-    ENUM_ENT(EF_CUDA_SM75, "sm_75"),   ENUM_ENT(EF_CUDA_SM80, "sm_80"),
-    ENUM_ENT(EF_CUDA_SM86, "sm_86"),   ENUM_ENT(EF_CUDA_SM87, "sm_87"),
-    ENUM_ENT(EF_CUDA_SM89, "sm_89"),   ENUM_ENT(EF_CUDA_SM90, "sm_90"),
-    ENUM_ENT(EF_CUDA_SM100, "sm_100"), ENUM_ENT(EF_CUDA_SM101, "sm_101"),
-    ENUM_ENT(EF_CUDA_SM103, "sm_103"), ENUM_ENT(EF_CUDA_SM120, "sm_120"),
+    ENUM_ENT(EF_CUDA_SM20, "sm_20"),
+    ENUM_ENT(EF_CUDA_SM21, "sm_21"),
+    ENUM_ENT(EF_CUDA_SM30, "sm_30"),
+    ENUM_ENT(EF_CUDA_SM32, "sm_32"),
+    ENUM_ENT(EF_CUDA_SM35, "sm_35"),
+    ENUM_ENT(EF_CUDA_SM37, "sm_37"),
+    ENUM_ENT(EF_CUDA_SM50, "sm_50"),
+    ENUM_ENT(EF_CUDA_SM52, "sm_52"),
+    ENUM_ENT(EF_CUDA_SM53, "sm_53"),
+    ENUM_ENT(EF_CUDA_SM60, "sm_60"),
+    ENUM_ENT(EF_CUDA_SM61, "sm_61"),
+    ENUM_ENT(EF_CUDA_SM62, "sm_62"),
+    ENUM_ENT(EF_CUDA_SM70, "sm_70"),
+    ENUM_ENT(EF_CUDA_SM72, "sm_72"),
+    ENUM_ENT(EF_CUDA_SM75, "sm_75"),
+    ENUM_ENT(EF_CUDA_SM80, "sm_80"),
+    ENUM_ENT(EF_CUDA_SM86, "sm_86"),
+    ENUM_ENT(EF_CUDA_SM87, "sm_87"),
+    ENUM_ENT(EF_CUDA_SM89, "sm_89"),
+    ENUM_ENT(EF_CUDA_SM90, "sm_90"),
+    ENUM_ENT(EF_CUDA_SM100, "sm_100"),
+    ENUM_ENT(EF_CUDA_SM101, "sm_101"),
+    ENUM_ENT(EF_CUDA_SM103, "sm_103"),
+    ENUM_ENT(EF_CUDA_SM120, "sm_120"),
     ENUM_ENT(EF_CUDA_SM121, "sm_121"),
+    ENUM_ENT(EF_CUDA_SM20 << EF_CUDA_SM_OFFSET, "sm_20"),
+    ENUM_ENT(EF_CUDA_SM21 << EF_CUDA_SM_OFFSET, "sm_21"),
+    ENUM_ENT(EF_CUDA_SM30 << EF_CUDA_SM_OFFSET, "sm_30"),
+    ENUM_ENT(EF_CUDA_SM32 << EF_CUDA_SM_OFFSET, "sm_32"),
+    ENUM_ENT(EF_CUDA_SM35 << EF_CUDA_SM_OFFSET, "sm_35"),
+    ENUM_ENT(EF_CUDA_SM37 << EF_CUDA_SM_OFFSET, "sm_37"),
+    ENUM_ENT(EF_CUDA_SM50 << EF_CUDA_SM_OFFSET, "sm_50"),
+    ENUM_ENT(EF_CUDA_SM52 << EF_CUDA_SM_OFFSET, "sm_52"),
+    ENUM_ENT(EF_CUDA_SM53 << EF_CUDA_SM_OFFSET, "sm_53"),
+    ENUM_ENT(EF_CUDA_SM60 << EF_CUDA_SM_OFFSET, "sm_60"),
+    ENUM_ENT(EF_CUDA_SM61 << EF_CUDA_SM_OFFSET, "sm_61"),
+    ENUM_ENT(EF_CUDA_SM62 << EF_CUDA_SM_OFFSET, "sm_62"),
+    ENUM_ENT(EF_CUDA_SM70 << EF_CUDA_SM_OFFSET, "sm_70"),
+    ENUM_ENT(EF_CUDA_SM72 << EF_CUDA_SM_OFFSET, "sm_72"),
+    ENUM_ENT(EF_CUDA_SM75 << EF_CUDA_SM_OFFSET, "sm_75"),
+    ENUM_ENT(EF_CUDA_SM80 << EF_CUDA_SM_OFFSET, "sm_80"),
+    ENUM_ENT(EF_CUDA_SM86 << EF_CUDA_SM_OFFSET, "sm_86"),
+    ENUM_ENT(EF_CUDA_SM87 << EF_CUDA_SM_OFFSET, "sm_87"),
+    ENUM_ENT(EF_CUDA_SM89 << EF_CUDA_SM_OFFSET, "sm_89"),
+    ENUM_ENT(EF_CUDA_SM90 << EF_CUDA_SM_OFFSET, "sm_90"),
+    ENUM_ENT(EF_CUDA_SM100 << EF_CUDA_SM_OFFSET, "sm_100"),
+    ENUM_ENT(EF_CUDA_SM101 << EF_CUDA_SM_OFFSET, "sm_101"),
+    ENUM_ENT(EF_CUDA_SM103 << EF_CUDA_SM_OFFSET, "sm_103"),
+    ENUM_ENT(EF_CUDA_SM120 << EF_CUDA_SM_OFFSET, "sm_120"),
+    ENUM_ENT(EF_CUDA_SM121 << EF_CUDA_SM_OFFSET, "sm_121"),
 };
 
 const EnumEntry<unsigned> ElfHeaderRISCVFlags[] = {
diff --git a/offload/plugins-nextgen/cuda/src/rtl.cpp b/offload/plugins-nextgen/cuda/src/rtl.cpp
index 99195cd8d7c99..d973c2d4dd320 100644
--- a/offload/plugins-nextgen/cuda/src/rtl.cpp
+++ b/offload/plugins-nextgen/cuda/src/rtl.cpp
@@ -1581,7 +1581,7 @@ struct CUDAPluginTy final : public GenericPluginTy {
     unsigned SM =
         Header.e_ident[ELF::EI_ABIVERSION] == ELF::ELFABIVERSION_CUDA_V1
             ? Header.e_flags & ELF::EF_CUDA_SM
-            : (Header.e_flags & ELF::EF_CUDA_SM_MASK) >> 8;
+            : (Header.e_flags & ELF::EF_CUDA_SM_MASK) >> ELF::EF_CUDA_SM_OFFSET;
 
     CUdevice Device;
     CUresult Res = cuDeviceGet(&Device, DeviceId);

Summary: Turns out the new CUDA ABI now applies retroactively to all the other SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping all the SM flags consistent but using an offset. Fixes: llvm#159088

Artem-B · 2025-09-17T18:13:54Z

llvm/include/llvm/BinaryFormat/ELF.h

-  // The sm_90a variant uses the same machine flag.
  EF_CUDA_SM90 = 0x5a,
+  EF_CUDA_SM100 = 0x64,
+  EF_CUDA_SM101 = 0x65,


About that sm_101.
In CUDA-13 it's been renamed to sm_110.
You may want to check what NVIDIA tools end up generating for the same GPU in cuda-12.9 (sm_101) and 13.0(sm110), and whether NVIDIA kept the same ELF flags for them, or changed them.

I don't have easy access to CUDA 13.0 yet, just an ELF someone else gave me which these work on. Will it be sufficient to just handle both cases?

Here's the dump of ELF headers for sm_110/a/f with cuda-13 and 101/a/f with cuda-12.9: https://gist.github.com/Artem-B/1995e3bd80a06b3bee33196e8753d73b

The definitions look fine.

Artem-B · 2025-09-17T18:14:47Z

llvm/include/llvm/BinaryFormat/ELF.h

  EF_CUDA_SM75 = 0x4b,
  EF_CUDA_SM80 = 0x50,
  EF_CUDA_SM86 = 0x56,
  EF_CUDA_SM87 = 0x57,


NVIDIA has added sm_88, too. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#changes-in-ptx-isa-version-9-0

Artem-B

Few copy/paste errors, LGTM otherwise.

Artem-B · 2025-09-17T18:54:26Z

llvm/include/llvm/BinaryFormat/ELF.h

-  // The sm_90a variant uses the same machine flag.
  EF_CUDA_SM90 = 0x5a,
+  EF_CUDA_SM100 = 0x64,
+  EF_CUDA_SM101 = 0x65,


Here's the dump of ELF headers for sm_110/a/f with cuda-13 and 101/a/f with cuda-12.9: https://gist.github.com/Artem-B/1995e3bd80a06b3bee33196e8753d73b

The definitions look fine.

llvm/tools/llvm-readobj/ELFDumper.cpp

Summary: We rely on these flags to do things in the runtime and print the contents of binaries correctly. CUDA updated their ABI encoding recently and we didn't handle that. it's a new ABI entirely so we just select on it when it shows up. Fixes: llvm#148703 [LLVM] Fix offload and update CUDA ABI for all SM values (llvm#159354) Summary: Turns out the new CUDA ABI now applies retroactively to all the other SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping all the SM flags consistent but using an offset. Fixes: llvm#159088

Summary: Turns out the new CUDA ABI now applies retroactively to all the other SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping all the SM flags consistent but using an offset. Fixes: llvm#159088

Summary: We rely on these flags to do things in the runtime and print the contents of binaries correctly. CUDA updated their ABI encoding recently and we didn't handle that. it's a new ABI entirely so we just select on it when it shows up. Fixes: llvm#148703 [LLVM] Fix offload and update CUDA ABI for all SM values (llvm#159354) Summary: Turns out the new CUDA ABI now applies retroactively to all the other SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping all the SM flags consistent but using an offset. Fixes: llvm#159088

Summary: Turns out the new CUDA ABI now applies retroactively to all the other SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping all the SM flags consistent but using an offset. Fixes: llvm#159088

jhuber6 requested review from AlexMaclean, arsenm, Artem-B and jlebar September 17, 2025 13:11

llvmbot added llvm:binary-utilities offload labels Sep 17, 2025

jhuber6 changed the title ~~[LLVM] Update CUDA ABI for all SM values~~ [LLVM] Fix offload and update CUDA ABI for all SM values Sep 17, 2025

jhuber6 mentioned this pull request Sep 17, 2025

[Offload] Offload to NVIDIA GPUs fails with CUDA 13.0 or newer (LLVM 20.1.8) #159088

Closed

[LLVM] Update CUDA ABI for all SM values

861bd85

Summary: Turns out the new CUDA ABI now applies retroactively to all the other SMs if you upgrade to CUDA 13.0. This patch changes the scheme, keeping all the SM flags consistent but using an offset. Fixes: llvm#159088

jhuber6 force-pushed the CUDA_AGAIN branch from 23c6f6c to 861bd85 Compare September 17, 2025 13:13

Artem-B reviewed Sep 17, 2025

View reviewed changes

Add 88 and 110

70815f0

Artem-B approved these changes Sep 17, 2025

View reviewed changes

fix

89dee73

jhuber6 merged commit dffd7f3 into llvm:main Sep 17, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLVM] Fix offload and update CUDA ABI for all SM values #159354

[LLVM] Fix offload and update CUDA ABI for all SM values #159354

Uh oh!

jhuber6 commented Sep 17, 2025

Uh oh!

llvmbot commented Sep 17, 2025 •

edited

Loading

Uh oh!

Artem-B Sep 17, 2025

Uh oh!

jhuber6 Sep 17, 2025

Uh oh!

Artem-B Sep 17, 2025

Uh oh!

Artem-B Sep 17, 2025

Uh oh!

Artem-B left a comment

Uh oh!

Artem-B Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[LLVM] Fix offload and update CUDA ABI for all SM values #159354

[LLVM] Fix offload and update CUDA ABI for all SM values #159354

Uh oh!

Conversation

jhuber6 commented Sep 17, 2025

Uh oh!

llvmbot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

jhuber6 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Artem-B Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Artem-B Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Artem-B Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Sep 17, 2025 •

edited

Loading