[Clang] Introduce 'clang-nvlink-wrappaer' to work around 'nvlink'

jhuber6 · jhuber6 · commit 849c8dab14c9 · 2024-06-27T16:55:51.000-05:00
Summary:
The `clang-nvlink-wrapper` is a utility that I removed awhile back
during the transition to the new driver. This patch adds back in a new,
upgraded version that does LTO + archive linking. It's not an easy
choice to reintroduce something I happily deleted, but this is the only
way to move forward with improving GPU support in LLVM.

While NVIDIA provides a linker called 'nvlink', its main interface is
very difficult to work with. It does not provide LTO, or static linking,
requires all files to be named a non-standard `.cubin`, and rejects link
jobs that other linkers would be fine with (i.e empty). I have spent a
great deal of time hacking around this in the GPU `libc` implementation,
where I deliberately avoid LTO and static linking and have about 100
lines of hacky CMake dedicated to storing these files in a format that
the clang-linker-wrapper accepts to avoid this limitation.

The main reason I want to re-intorudce this tool is because I am
planning on creating a more standard C/C++ toolchain for GPUs to use.
This will install files like the following.
```
&lt;install&gt;/lib/nvptx64-nvidia-cuda/libc.a
&lt;install&gt;/lib/nvptx64-nvidia-cuda/libc++.a
&lt;install&gt;/lib/nvptx64-nvidia-cuda/libomp.a
&lt;install&gt;/lib/clang/19/lib/nvptx64-nvidia-cuda/libclang_rt.builtins.a
```
Linking in these libraries will then simply require passing `-lc` like
is already done for non-GPU toolchains. However, this doesn't work with
the currently deficient `nvlink` linker, so I consider this a blocking
issue to massively improving the state of building GPU libraries.

In the future we may be able to convince NVIDIA to port their linker to
`ld.lld`, but for now this is the only workable solution that allows us
to hack around the weird behavior of their closed-source software.
diff --git a/clang/docs/ClangNVLinkWrapper.rst b/clang/docs/ClangNVLinkWrapper.rst
@@ -0,0 +1,64 @@
+====================
+Clang nvlink Wrapper
+====================
+
+.. contents::
+   :local:
+
+.. _clang-nvlink-wrapper:
+
+Introduction
+============
+
+This tools works as a wrapper around the NVIDIA ``nvlink`` linker. The purpose 
+of this wrapper is to provide an interface similar to the ``ld.lld`` linker 
+while still relying on NVIDIA's proprietary linker to produce the final output. 
+Features include, static archive (.a) linking, LTO, and accepting files ending 
+in ``.o`` without error.
+
+Usage
+=====
+
+This tool can be used with the following options. Any arguments not intended
+only for the linker wrapper will be forwarded to ``nvlink``.
+
+.. code-block:: console
+
+  OVERVIEW: A utility that wraps around the NVIDIA 'nvlink' linker.
+  This enables static linking and LTO handling for NVPTX targets.
+
+  USAGE: clang-nvlink-wrapper [options] <options to passed to nvlink>
+
+  OPTIONS:
+    --arch <value>       Specify the 'sm_' name of the target architecture.
+    --cuda-path=<dir>    Set the system CUDA path
+    --dry-run            Print generated commands without running.
+    --feature <value>    Specify the '+ptx' freature to use for LTO.
+    -g                   Specify that this was a debug compile.
+    -help-hidden         Display all available options
+    -help                Display available options (--help-hidden for more)
+    -L <dir>             Add <dir> to the library search path
+    -l <libname>         Search for library <libname>
+    -mllvm <arg>         Arguments passed to LLVM, including Clang invocations, for which the '-mllvm' prefix is preserved. Use '-mllvm --help' for a list of options.
+    -o <path>            Path to file to write output
+    --plugin-opt=jobs=<value>
+                         Number of LTO codegen partitions
+    --plugin-opt=lto-partitions=<value>
+                         Number of LTO codegen partitions
+    --plugin-opt=O<O0, O1, O2, or O3>
+                         Optimization level for LTO
+    --plugin-opt=thinlto<value>
+                         Enable the thin-lto backend
+    --plugin-opt=<value> Options passed to LLVM, not including the Clang invocation. Use '--plugin-opt=--help' for a list of options.
+    --save-temps         Save intermediate results
+    --version            Display the version number and exit
+    -v                   Print verbose information
+
+Example
+=======
+
+This tool is intended to be invoked when targeting the NVPTX toolchain directly. 
+
+.. code-block:: console
+
+  clang --target=nvptx64-nvidia-cuda -march=native -flto=full input.c
diff --git a/clang/docs/index.rst b/clang/docs/index.rst
@@ -92,6 +92,7 @@ Using Clang Tools
    ClangFormatStyleOptions
    ClangFormattedStatus
    ClangLinkerWrapper
+   ClangNVLinkWrapper
    ClangOffloadBundler
    ClangOffloadPackager
    ClangRepl
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -461,13 +461,6 @@ void NVPTX::Assembler::ConstructJob(Compilation &C, const JobAction &JA,
   CmdArgs.push_back("--output-file");
   std::string OutputFileName = TC.getInputFilename(Output);
 
-  // If we are invoking `nvlink` internally we need to output a `.cubin` file.
-  // FIXME: This should hopefully be removed if NVIDIA updates their tooling.
-  if (!C.getInputArgs().getLastArg(options::OPT_c)) {
-    SmallString<256> Filename(Output.getFilename());
-    llvm::sys::path::replace_extension(Filename, "cubin");
-    OutputFileName = Filename.str();
-  }
   if (Output.isFilename() && OutputFileName != Output.getFilename())
     C.addTempFile(Args.MakeArgString(OutputFileName));
 
@@ -618,58 +611,24 @@ void NVPTX::Linker::ConstructJob(Compilation &C, const JobAction &JA,
   // Add standard library search paths passed on the command line.
   Args.AddAllArgs(CmdArgs, options::OPT_L);
   getToolChain().AddFilePathLibArgs(Args, CmdArgs);
+  AddLinkerInputs(getToolChain(), Inputs, Args, CmdArgs, JA);
+
+  if (C.getDriver().isUsingLTO())
+    addLTOOptions(getToolChain(), Args, CmdArgs, Output, Inputs[0],
+                  C.getDriver().getLTOMode() == LTOK_Thin);
 
   // Add paths for the default clang library path.
   SmallString<256> DefaultLibPath =
       llvm::sys::path::parent_path(TC.getDriver().Dir);
   llvm::sys::path::append(DefaultLibPath, CLANG_INSTALL_LIBDIR_BASENAME);
   CmdArgs.push_back(Args.MakeArgString(Twine("-L") + DefaultLibPath));
 
-  for (const auto &II : Inputs) {
-    if (II.getType() == types::TY_LLVM_IR || II.getType() == types::TY_LTO_IR ||
-        II.getType() == types::TY_LTO_BC || II.getType() == types::TY_LLVM_BC) {
-      C.getDriver().Diag(diag::err_drv_no_linker_llvm_support)
-          << getToolChain().getTripleString();
-      continue;
-    }
-
-    // The 'nvlink' application performs RDC-mode linking when given a '.o'
-    // file and device linking when given a '.cubin' file. We always want to
-    // perform device linking, so just rename any '.o' files.
-    // FIXME: This should hopefully be removed if NVIDIA updates their tooling.
-    if (II.isFilename()) {
-      auto InputFile = getToolChain().getInputFilename(II);
-      if (llvm::sys::path::extension(InputFile) != ".cubin") {
-        // If there are no actions above this one then this is direct input and
-        // we can copy it. Otherwise the input is internal so a `.cubin` file
-        // should exist.
-        if (II.getAction() && II.getAction()->getInputs().size() == 0) {
-          const char *CubinF =
-              Args.MakeArgString(getToolChain().getDriver().GetTemporaryPath(
-                  llvm::sys::path::stem(InputFile), "cubin"));
-          if (llvm::sys::fs::copy_file(InputFile, C.addTempFile(CubinF)))
-            continue;
-
-          CmdArgs.push_back(CubinF);
-        } else {
-          SmallString<256> Filename(InputFile);
-          llvm::sys::path::replace_extension(Filename, "cubin");
-          CmdArgs.push_back(Args.MakeArgString(Filename));
-        }
-      } else {
-        CmdArgs.push_back(Args.MakeArgString(InputFile));
-      }
-    } else if (!II.isNothing()) {
-      II.getInputArg().renderAsInput(Args, CmdArgs);
-    }
-  }
-
   C.addCommand(std::make_unique<Command>(
       JA, *this,
       ResponseFileSupport{ResponseFileSupport::RF_Full, llvm::sys::WEM_UTF8,
                           "--options-file"},
-      Args.MakeArgString(getToolChain().GetProgramPath("nvlink")), CmdArgs,
-      Inputs, Output));
+      Args.MakeArgString(getToolChain().GetProgramPath("clang-nvlink-wrapper")),
+      CmdArgs, Inputs, Output));
 }
 
 void NVPTX::getNVPTXTargetFeatures(const Driver &D, const llvm::Triple &Triple,
@@ -949,11 +908,7 @@ std::string CudaToolChain::getInputFilename(const InputInfo &Input) const {
   if (Input.getType() != types::TY_Object || getDriver().offloadDeviceOnly())
     return ToolChain::getInputFilename(Input);
 
-  // Replace extension for object files with cubin because nvlink relies on
-  // these particular file names.
-  SmallString<256> Filename(ToolChain::getInputFilename(Input));
-  llvm::sys::path::replace_extension(Filename, "cubin");
-  return std::string(Filename);
+  return ToolChain::getInputFilename(Input);
 }
 
 llvm::opt::DerivedArgList *
diff --git a/clang/lib/Driver/ToolChains/Cuda.h b/clang/lib/Driver/ToolChains/Cuda.h
@@ -155,6 +155,7 @@ class LLVM_LIBRARY_VISIBILITY NVPTXToolChain : public ToolChain {
   bool isPIEDefault(const llvm::opt::ArgList &Args) const override {
     return false;
   }
+  bool HasNativeLLVMSupport() const override { return true; }
   bool isPICDefaultForced() const override { return false; }
   bool SupportsProfiling() const override { return false; }
 
@@ -192,6 +193,8 @@ class LLVM_LIBRARY_VISIBILITY CudaToolChain : public NVPTXToolChain {
     return &HostTC.getTriple();
   }
 
+  bool HasNativeLLVMSupport() const override { return false; }
+
   std::string getInputFilename(const InputInfo &Input) const override;
 
   llvm::opt::DerivedArgList *
diff --git a/clang/test/Driver/cuda-cross-compiling.c b/clang/test/Driver/cuda-cross-compiling.c
@@ -32,8 +32,8 @@
 // RUN:   | FileCheck -check-prefix=ARGS %s
 
 //      ARGS: -cc1" "-triple" "nvptx64-nvidia-cuda" "-S" {{.*}} "-target-cpu" "sm_61" "-target-feature" "+ptx{{[0-9]+}}" {{.*}} "-o" "[[PTX:.+]].s"
-// ARGS-NEXT: ptxas{{.*}}"-m64" "-O0" "--gpu-name" "sm_61" "--output-file" "[[CUBIN:.+]].cubin" "[[PTX]].s" "-c"
-// ARGS-NEXT: nvlink{{.*}}"-o" "a.out" "-arch" "sm_61" {{.*}} "[[CUBIN]].cubin"
+// ARGS-NEXT: ptxas{{.*}}"-m64" "-O0" "--gpu-name" "sm_61" "--output-file" "[[CUBIN:.+]].o" "[[PTX]].s" "-c"
+// ARGS-NEXT: clang-nvlink-wrapper{{.*}}"-o" "a.out" "-arch" "sm_61"{{.*}}"[[CUBIN]].o"
 
 //
 // Test the generated arguments to the CUDA binary utils when targeting NVPTX. 
@@ -55,7 +55,7 @@
 // RUN: %clang -target nvptx64-nvidia-cuda -march=sm_61 -### %t.o 2>&1 \
 // RUN:   | FileCheck -check-prefix=LINK %s
 
-// LINK: nvlink{{.*}}"-o" "a.out" "-arch" "sm_61" {{.*}} "{{.*}}.cubin"
+// LINK: clang-nvlink-wrapper{{.*}}"-o" "a.out" "-arch" "sm_61"{{.*}}[[CUBIN]].o
 
 //
 // Test to ensure that we enable handling global constructors in a freestanding
@@ -72,7 +72,7 @@
 // RUN: %clang -target nvptx64-nvidia-cuda -Wl,-v -Wl,a,b -march=sm_52 -### %s 2>&1 \
 // RUN:   | FileCheck -check-prefix=LINKER-ARGS %s
 
-// LINKER-ARGS: nvlink{{.*}}"-v"{{.*}}"a" "b"
+// LINKER-ARGS: clang-nvlink-wrapper{{.*}}"-v"{{.*}}"a" "b"
 
 // Tests for handling a missing architecture.
 //
diff --git a/clang/test/Driver/nvlink-wrapper.c b/clang/test/Driver/nvlink-wrapper.c
@@ -0,0 +1,65 @@
+// REQUIRES: x86-registered-target
+// REQUIRES: nvptx-registered-target
+
+#if defined(X)
+extern int y;
+int foo() { return y; }
+
+int x = 0;
+#elif defined(Y)
+int y = 42;
+#elif defined(Z)
+int z = 42;
+#elif defined(W)
+int w = 42;
+#elif defined(U)
+extern int x;
+extern int __attribute__((weak)) w;
+
+int bar() {
+  return x + w;
+}
+#else
+extern int y;
+int __attribute__((visibility("hidden"))) x = 999;
+int baz() { return y + x; }
+#endif
+
+// Create various inputs to test basic linking and LTO capabilities. Creating a
+// CUDA binary requires access to the `ptxas` executable, so we just use x64.
+// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -DX -o %t-x.o
+// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -DY -o %t-y.o
+// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -DZ -o %t-z.o
+// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -DW -o %t-w.o
+// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -DU -o %t-u.o
+// RUN: llvm-ar rcs %t-x.a %t-x.o
+// RUN: llvm-ar rcs %t-y.a %t-y.o
+// RUN: llvm-ar rcs %t-z.a %t-z.o
+// RUN: llvm-ar rcs %t-w.a %t-w.o
+
+//
+// Check that we forward any unrecognized argument to 'nvlink'.
+//
+// RUN: clang-nvlink-wrapper --dry-run -arch sm_52 %t-u.o -foo -o a.out 2>&1 \
+// RUN:   | FileCheck %s --check-prefix=ARGS
+// ARGS: nvlink{{.*}} -arch sm_52 -foo -o a.out [[INPUT:.+]].cubin
+
+//
+// Check the symbol resolution for static archives. We expect to only link
+// `libx.a` and `liby.a` because extern weak symbols do not extract and `libz.a`
+// is not used at all.
+//
+// RUN: clang-nvlink-wrapper --dry-run %t-x.a %t-u.o %t-y.a %t-z.a %t-w.a \
+// RUN:   -arch sm_52 -o a.out 2>&1 | FileCheck %s --check-prefix=LINK
+// LINK: nvlink{{.*}} -arch sm_52 -o a.out [[INPUT:.+]].cubin {{.*}}-x-{{.*}}.cubin{{.*}}-y-{{.*}}.cubin
+
+// RUN: %clang -cc1 %s -triple nvptx64-nvidia-cuda -emit-llvm-bc -o %t.o
+
+//
+// Check that the LTO interface works and properly preserves symbols used in a
+// regular object file.
+//
+// RUN: clang-nvlink-wrapper --dry-run %t.o %t-u.o %t-y.a \
+// RUN:   -arch sm_52 -o a.out 2>&1 | FileCheck %s --check-prefix=LTO
+// LTO: ptxas{{.*}} -m64 -c [[PTX:.+]].s -O3 -arch sm_52 -o [[CUBIN:.+]].cubin
+// LTO: nvlink{{.*}} -arch sm_52 -o a.out [[CUBIN]].cubin {{.*}}-u-{{.*}}.cubin {{.*}}-y-{{.*}}.cubin
diff --git a/clang/test/lit.cfg.py b/clang/test/lit.cfg.py
@@ -95,6 +95,7 @@
     "llvm-ifs",
     "yaml2obj",
     "clang-linker-wrapper",
+    "clang-nvlink-wrapper",
     "llvm-lto",
     "llvm-lto2",
     "llvm-profdata",
diff --git a/clang/tools/CMakeLists.txt b/clang/tools/CMakeLists.txt
@@ -9,6 +9,7 @@ add_clang_subdirectory(clang-format-vs)
 add_clang_subdirectory(clang-fuzzer)
 add_clang_subdirectory(clang-import-test)
 add_clang_subdirectory(clang-linker-wrapper)
+add_clang_subdirectory(clang-nvlink-wrapper)
 add_clang_subdirectory(clang-offload-packager)
 add_clang_subdirectory(clang-offload-bundler)
 add_clang_subdirectory(clang-scan-deps)
diff --git a/clang/tools/clang-nvlink-wrapper/CMakeLists.txt b/clang/tools/clang-nvlink-wrapper/CMakeLists.txt
@@ -0,0 +1,44 @@
+set(LLVM_LINK_COMPONENTS
+  ${LLVM_TARGETS_TO_BUILD}
+  BitWriter
+  Core
+  BinaryFormat
+  MC
+  Target
+  TransformUtils
+  Analysis
+  Passes
+  IRReader
+  Object
+  Option
+  Support
+  TargetParser
+  CodeGen
+  LTO
+  )
+
+set(LLVM_TARGET_DEFINITIONS NVLinkOpts.td)
+tablegen(LLVM NVLinkOpts.inc -gen-opt-parser-defs)
+add_public_tablegen_target(NVLinkWrapperOpts)
+
+if(NOT CLANG_BUILT_STANDALONE)
+  set(tablegen_deps intrinsics_gen NVLinkWrapperOpts)
+endif()
+
+add_clang_tool(clang-nvlink-wrapper
+  ClangNVLinkWrapper.cpp
+
+  DEPENDS
+  ${tablegen_deps}
+  )
+
+set(CLANG_NVLINK_WRAPPER_LIB_DEPS
+  clangBasic
+  )
+
+target_compile_options(clang-nvlink-wrapper PRIVATE "-g" "-O0")
+
+target_link_libraries(clang-nvlink-wrapper
+  PRIVATE
+  ${CLANG_NVLINK_WRAPPER_LIB_DEPS}
+  )
diff --git a/clang/tools/clang-nvlink-wrapper/ClangNVLinkWrapper.cpp b/clang/tools/clang-nvlink-wrapper/ClangNVLinkWrapper.cpp
diff --git a/clang/tools/clang-nvlink-wrapper/NVLinkOpts.td b/clang/tools/clang-nvlink-wrapper/NVLinkOpts.td

Original file line number	Diff line number	Diff line change
`@@ -155,6 +155,7 @@ class LLVM_LIBRARY_VISIBILITY NVPTXToolChain : public ToolChain {`
`155`	`155`	`bool isPIEDefault(const llvm::opt::ArgList &Args) const override {`
`156`	`156`	`return false;`
`157`	`157`	`}`
	`158`	`+ bool HasNativeLLVMSupport() const override { return true; }`
`158`	`159`	`bool isPICDefaultForced() const override { return false; }`
`159`	`160`	`bool SupportsProfiling() const override { return false; }`
`160`	`161`
`@@ -192,6 +193,8 @@ class LLVM_LIBRARY_VISIBILITY CudaToolChain : public NVPTXToolChain {`
`192`	`193`	`return &HostTC.getTriple();`
`193`	`194`	`}`
`194`	`195`
	`196`	`+ bool HasNativeLLVMSupport() const override { return false; }`
	`197`	`+`
`195`	`198`	`std::string getInputFilename(const InputInfo &Input) const override;`
`196`	`199`
`197`	`200`	`llvm::opt::DerivedArgList *`