Skip to content

[Offload] Add Error Codes to PluginInterface #138258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RossBrunton
Copy link
Contributor

@RossBrunton RossBrunton commented May 2, 2025

A new ErrorCode enumeration is present in PluginInterface which can
be used when returning an llvm::Error from offload and PluginInterface
functions.

This enum must be kept up to sync with liboffload's ol_errc_t enum, so
both are automatically generated from liboffload's enum definition.

Some error codes have also been shuffled around to allow for future
work. Note that this patch only adds the machinery; actual error codes
will be added in a future patch.

Depends on #137339 , please ignore first commit of this MR. This has been merged.

@llvmbot llvmbot added the offload label May 2, 2025
@RossBrunton
Copy link
Contributor Author

I'm not sure if this is the exact design LLVM is looking for for error codes here, but it's probably worth having something to look at and discuss as a possible solution.

Basically, we have both liboffload and PluginInterface wanting to share the same error codes (to avoid converting to and from them). However, we also want these error codes to be available in the OffloadAPI.h file directly so only one file needs to be shipped to users. Further compounding this is that OffloadAPI.h is a C header, so we can't use any of the std::error_code goodness.

I've added a new tablegen target such that when offload is generating OffloadAPI.h, it also generates a OffloadErrcodes.inc file inside the plugins include dir (which ensures that they always stay in sync), which is checked into the repo like generated files.

This is a bit jank, so if anyone knows of a simpler way, let me know.

@llvmbot
Copy link
Member

llvmbot commented May 2, 2025

@llvm/pr-subscribers-offload

Author: Ross Brunton (RossBrunton)

Changes

A new ErrorCode enumeration is present in PluginInterface which can
be used when returning an llvm::Error from offload and PluginInterface
functions.

This enum must be kept up to sync with liboffload's ol_errc_t enum, so
both are automatically generated from liboffload's enum definition.

Some error codes have also been shuffled around to allow for future
work. Note that this patch only adds the machinery; actual error codes
will be added in a future patch.

Depends on #137339 , please ignore first commit of this MR.


Patch is 26.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138258.diff

17 Files Affected:

  • (modified) offload/liboffload/API/APIDefs.td (+2-1)
  • (modified) offload/liboffload/API/CMakeLists.txt (+7-3)
  • (modified) offload/liboffload/API/Common.td (+13-9)
  • (modified) offload/liboffload/CMakeLists.txt (+2)
  • (modified) offload/liboffload/include/OffloadImpl.hpp (+22)
  • (modified) offload/liboffload/include/generated/OffloadAPI.h (+41-42)
  • (modified) offload/liboffload/include/generated/OffloadPrint.hpp (+12-15)
  • (modified) offload/liboffload/src/OffloadImpl.cpp (+18-15)
  • (added) offload/plugins-nextgen/common/include/OffloadErrcodes.inc (+37)
  • (modified) offload/plugins-nextgen/common/include/PluginInterface.h (+37-1)
  • (modified) offload/tools/offload-tblgen/APIGen.cpp (+3)
  • (modified) offload/tools/offload-tblgen/CMakeLists.txt (+1-2)
  • (modified) offload/tools/offload-tblgen/Generators.hpp (+2)
  • (renamed) offload/tools/offload-tblgen/MiscGen.cpp (+24)
  • (modified) offload/tools/offload-tblgen/RecordTypes.hpp (+7)
  • (modified) offload/tools/offload-tblgen/offload-tblgen.cpp (+7-2)
  • (modified) offload/unittests/OffloadAPI/kernel/olGetKernel.cpp (+7)
diff --git a/offload/liboffload/API/APIDefs.td b/offload/liboffload/API/APIDefs.td
index 640932dcf8464..38508525f1d26 100644
--- a/offload/liboffload/API/APIDefs.td
+++ b/offload/liboffload/API/APIDefs.td
@@ -152,9 +152,10 @@ class Function : APIObject {
         AddHandleChecksToReturns<params, returns_with_def>.returns_out>.returns_out;
 }
 
-class Etor<string Name, string Desc> {
+class Etor<string Name, string Desc, int Value=-1> {
   string name = Name;
   string desc = Desc;
+  int value = Value;
   string tagged_type;
 }
 
diff --git a/offload/liboffload/API/CMakeLists.txt b/offload/liboffload/API/CMakeLists.txt
index 8fd6cb539374a..62b8803ca2a13 100644
--- a/offload/liboffload/API/CMakeLists.txt
+++ b/offload/liboffload/API/CMakeLists.txt
@@ -11,13 +11,17 @@ if (CLANG_FORMAT)
     tablegen(OFFLOAD OffloadFuncs.inc -gen-func-names)
     tablegen(OFFLOAD OffloadImplFuncDecls.inc -gen-impl-func-decls)
     tablegen(OFFLOAD OffloadPrint.hpp -gen-print-header)
+    tablegen(OFFLOAD OffloadErrcodes.inc -gen-errcodes)
 
-    set(OFFLOAD_GENERATED_FILES ${TABLEGEN_OUTPUT})
+    set(FILES_TO_COPY "OffloadAPI.h;OffloadEntryPoints.inc;OffloadFuncs.inc;OffloadImplFuncDecls.inc;OffloadPrint.hpp")
+    set(GEN_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../include/generated)
     add_public_tablegen_target(OffloadGenerate)
     add_custom_command(TARGET OffloadGenerate POST_BUILD COMMAND ${CLANG_FORMAT}
-        -i ${OFFLOAD_GENERATED_FILES})
+        -i ${TABLEGEN_OUTPUT})
     add_custom_command(TARGET OffloadGenerate POST_BUILD COMMAND ${CMAKE_COMMAND}
-        -E copy_if_different ${OFFLOAD_GENERATED_FILES} "${CMAKE_CURRENT_SOURCE_DIR}/../include/generated")
+        -E copy_if_different ${FILES_TO_COPY} ${GEN_DIR})
+    add_custom_command(TARGET OffloadGenerate POST_BUILD COMMAND ${CMAKE_COMMAND}
+        -E copy_if_different OffloadErrcodes.inc "${LIBOFFLOAD_ROOT}/../plugins-nextgen/common/include/OffloadErrcodes.inc")
 else()
     message(WARNING "clang-format was not found, so the OffloadGenerate target\
         will not be available. Offload will still build, but you will not be\
diff --git a/offload/liboffload/API/Common.td b/offload/liboffload/API/Common.td
index de7502b540618..12551d1eb5fd5 100644
--- a/offload/liboffload/API/Common.td
+++ b/offload/liboffload/API/Common.td
@@ -83,26 +83,30 @@ def : Typedef {
   let value = "void *";
 }
 
-def : Enum {
+def ErrorCode : Enum {
   let name = "ol_errc_t";
   let desc = "Defines Return/Error codes";
   let etors =[
-    Etor<"SUCCESS", "Success">,
-    Etor<"INVALID_VALUE", "Invalid Value">,
+    Etor<"SUCCESS", "Success", 0>,
+
+    // Universal errors
+    Etor<"INVALID_NULL_POINTER", "A pointer argument is null when it should not be">,
+    Etor<"INVALID_ARGUMENT", "An argument is invalid">,
+    Etor<"OUT_OF_RESOURCES", "Out of resources">,
+    Etor<"UNSUPPORTED", "generic error code for unsupported features and enums">,
+
+    // Liboffload specific errors
+    Etor<"INVALID_VALUE", "Invalid Value", 0x1000>,
     Etor<"INVALID_PLATFORM", "Invalid platform">,
     Etor<"INVALID_DEVICE", "Invalid device">,
     Etor<"INVALID_QUEUE", "Invalid queue">,
     Etor<"INVALID_EVENT", "Invalid event">,
     Etor<"INVALID_KERNEL_NAME", "Named kernel not found in the program binary">,
-    Etor<"OUT_OF_RESOURCES", "Out of resources">,
-    Etor<"UNSUPPORTED_FEATURE", "generic error code for unsupported features">,
-    Etor<"INVALID_ARGUMENT", "generic error code for invalid arguments">,
     Etor<"INVALID_NULL_HANDLE", "handle argument is not valid">,
-    Etor<"INVALID_NULL_POINTER", "pointer argument may not be nullptr">,
     Etor<"INVALID_SIZE", "invalid size or dimensions (e.g., must not be zero, or is out of bounds)">,
     Etor<"INVALID_ENUMERATION", "enumerator argument is not valid">,
-    Etor<"UNSUPPORTED_ENUMERATION", "enumerator argument is not supported by the device">,
-    Etor<"UNKNOWN", "Unknown or internal error">
+
+    Etor<"UNKNOWN", "Unknown or internal error", 0x10000>
   ];
 }
 
diff --git a/offload/liboffload/CMakeLists.txt b/offload/liboffload/CMakeLists.txt
index db12236ddfc7f..9927fa3c3400a 100644
--- a/offload/liboffload/CMakeLists.txt
+++ b/offload/liboffload/CMakeLists.txt
@@ -1,3 +1,5 @@
+set(LIBOFFLOAD_ROOT "${CMAKE_CURRENT_SOURCE_DIR}")
+
 add_subdirectory(API)
 
 add_llvm_library(
diff --git a/offload/liboffload/include/OffloadImpl.hpp b/offload/liboffload/include/OffloadImpl.hpp
index ec470a355309a..7d2c0c53fc85b 100644
--- a/offload/liboffload/include/OffloadImpl.hpp
+++ b/offload/liboffload/include/OffloadImpl.hpp
@@ -7,6 +7,7 @@
 //===----------------------------------------------------------------------===//
 #pragma once
 
+#include "PluginInterface.h"
 #include <OffloadAPI.h>
 #include <iostream>
 #include <memory>
@@ -19,6 +20,7 @@
 #include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/StringRef.h"
 #include "llvm/ADT/StringSet.h"
+#include "llvm/Support/Error.h"
 
 struct OffloadConfig {
   bool TracingEnabled = false;
@@ -88,8 +90,28 @@ struct ol_impl_result_t {
     Result = errors().emplace(std::move(Err)).first->get();
   }
 
+  static ol_impl_result_t fromError(llvm::Error &&Error) {
+    ol_errc_t ErrCode;
+    llvm::StringRef Details;
+    llvm::handleAllErrors(std::move(Error), [&](llvm::StringError &Err) {
+      ErrCode = GetErrorCode(Err.convertToErrorCode());
+      Details = errorStrs().insert(Err.getMessage()).first->getKeyData();
+    });
+
+    return ol_impl_result_t{ErrCode, Details};
+  }
+
   operator ol_result_t() { return Result; }
 
 private:
+  static ol_errc_t GetErrorCode(std::error_code Code) {
+    if (Code.category() == llvm::omp::target::plugin::make_error_code(
+                               llvm::omp::target::plugin::ErrorCode::SUCCESS)
+                               .category()) {
+      return static_cast<ol_errc_t>(Code.value());
+    }
+    return OL_ERRC_UNKNOWN;
+  }
+
   ol_result_t Result;
 };
diff --git a/offload/liboffload/include/generated/OffloadAPI.h b/offload/liboffload/include/generated/OffloadAPI.h
index ace31c57cf2f8..13a840ce772fb 100644
--- a/offload/liboffload/include/generated/OffloadAPI.h
+++ b/offload/liboffload/include/generated/OffloadAPI.h
@@ -17,6 +17,45 @@
 extern "C" {
 #endif
 
+///////////////////////////////////////////////////////////////////////////////
+/// @brief Defines Return/Error codes
+typedef enum ol_errc_t {
+  /// Success
+  OL_ERRC_SUCCESS = 0,
+  /// A pointer argument is null when it should not be
+  OL_ERRC_INVALID_NULL_POINTER = 1,
+  /// An argument is invalid
+  OL_ERRC_INVALID_ARGUMENT = 2,
+  /// Out of resources
+  OL_ERRC_OUT_OF_RESOURCES = 3,
+  /// generic error code for unsupported features and enums
+  OL_ERRC_UNSUPPORTED = 4,
+  /// Invalid Value
+  OL_ERRC_INVALID_VALUE = 4096,
+  /// Invalid platform
+  OL_ERRC_INVALID_PLATFORM = 4097,
+  /// Invalid device
+  OL_ERRC_INVALID_DEVICE = 4098,
+  /// Invalid queue
+  OL_ERRC_INVALID_QUEUE = 4099,
+  /// Invalid event
+  OL_ERRC_INVALID_EVENT = 4100,
+  /// Named kernel not found in the program binary
+  OL_ERRC_INVALID_KERNEL_NAME = 4101,
+  /// handle argument is not valid
+  OL_ERRC_INVALID_NULL_HANDLE = 4102,
+  /// invalid size or dimensions (e.g., must not be zero, or is out of bounds)
+  OL_ERRC_INVALID_SIZE = 4103,
+  /// enumerator argument is not valid
+  OL_ERRC_INVALID_ENUMERATION = 4104,
+  /// Unknown or internal error
+  OL_ERRC_UNKNOWN = 65536,
+  /// @cond
+  OL_ERRC_FORCE_UINT32 = 0x7fffffff
+  /// @endcond
+
+} ol_errc_t;
+
 ///////////////////////////////////////////////////////////////////////////////
 #ifndef OL_VERSION_MAJOR
 /// @brief Major version of the Offload API
@@ -101,47 +140,6 @@ typedef struct ol_program_impl_t *ol_program_handle_t;
 /// @brief Handle of kernel object
 typedef void *ol_kernel_handle_t;
 
-///////////////////////////////////////////////////////////////////////////////
-/// @brief Defines Return/Error codes
-typedef enum ol_errc_t {
-  /// Success
-  OL_ERRC_SUCCESS = 0,
-  /// Invalid Value
-  OL_ERRC_INVALID_VALUE = 1,
-  /// Invalid platform
-  OL_ERRC_INVALID_PLATFORM = 2,
-  /// Invalid device
-  OL_ERRC_INVALID_DEVICE = 3,
-  /// Invalid queue
-  OL_ERRC_INVALID_QUEUE = 4,
-  /// Invalid event
-  OL_ERRC_INVALID_EVENT = 5,
-  /// Named kernel not found in the program binary
-  OL_ERRC_INVALID_KERNEL_NAME = 6,
-  /// Out of resources
-  OL_ERRC_OUT_OF_RESOURCES = 7,
-  /// generic error code for unsupported features
-  OL_ERRC_UNSUPPORTED_FEATURE = 8,
-  /// generic error code for invalid arguments
-  OL_ERRC_INVALID_ARGUMENT = 9,
-  /// handle argument is not valid
-  OL_ERRC_INVALID_NULL_HANDLE = 10,
-  /// pointer argument may not be nullptr
-  OL_ERRC_INVALID_NULL_POINTER = 11,
-  /// invalid size or dimensions (e.g., must not be zero, or is out of bounds)
-  OL_ERRC_INVALID_SIZE = 12,
-  /// enumerator argument is not valid
-  OL_ERRC_INVALID_ENUMERATION = 13,
-  /// enumerator argument is not supported by the device
-  OL_ERRC_UNSUPPORTED_ENUMERATION = 14,
-  /// Unknown or internal error
-  OL_ERRC_UNKNOWN = 15,
-  /// @cond
-  OL_ERRC_FORCE_UINT32 = 0x7fffffff
-  /// @endcond
-
-} ol_errc_t;
-
 ///////////////////////////////////////////////////////////////////////////////
 /// @brief Details of the error condition returned by an API call
 typedef struct ol_error_struct_t {
@@ -477,7 +475,8 @@ OL_APIEXPORT ol_result_t OL_APICALL olMemFree(
 /// @brief Enqueue a memcpy operation.
 ///
 /// @details
-///    - For host pointers, use the device returned by olGetHostDevice
+///    - For host pointers, use the host device belonging to the
+///    OL_PLATFORM_BACKEND_HOST platform.
 ///    - If a queue is specified, at least one device must be a non-host device
 ///    - If a queue is not specified, the memcpy happens synchronously
 ///
diff --git a/offload/liboffload/include/generated/OffloadPrint.hpp b/offload/liboffload/include/generated/OffloadPrint.hpp
index 7f5e33aea6f73..e99bb2db669fb 100644
--- a/offload/liboffload/include/generated/OffloadPrint.hpp
+++ b/offload/liboffload/include/generated/OffloadPrint.hpp
@@ -49,6 +49,18 @@ inline llvm::raw_ostream &operator<<(llvm::raw_ostream &os,
   case OL_ERRC_SUCCESS:
     os << "OL_ERRC_SUCCESS";
     break;
+  case OL_ERRC_INVALID_NULL_POINTER:
+    os << "OL_ERRC_INVALID_NULL_POINTER";
+    break;
+  case OL_ERRC_INVALID_ARGUMENT:
+    os << "OL_ERRC_INVALID_ARGUMENT";
+    break;
+  case OL_ERRC_OUT_OF_RESOURCES:
+    os << "OL_ERRC_OUT_OF_RESOURCES";
+    break;
+  case OL_ERRC_UNSUPPORTED:
+    os << "OL_ERRC_UNSUPPORTED";
+    break;
   case OL_ERRC_INVALID_VALUE:
     os << "OL_ERRC_INVALID_VALUE";
     break;
@@ -67,30 +79,15 @@ inline llvm::raw_ostream &operator<<(llvm::raw_ostream &os,
   case OL_ERRC_INVALID_KERNEL_NAME:
     os << "OL_ERRC_INVALID_KERNEL_NAME";
     break;
-  case OL_ERRC_OUT_OF_RESOURCES:
-    os << "OL_ERRC_OUT_OF_RESOURCES";
-    break;
-  case OL_ERRC_UNSUPPORTED_FEATURE:
-    os << "OL_ERRC_UNSUPPORTED_FEATURE";
-    break;
-  case OL_ERRC_INVALID_ARGUMENT:
-    os << "OL_ERRC_INVALID_ARGUMENT";
-    break;
   case OL_ERRC_INVALID_NULL_HANDLE:
     os << "OL_ERRC_INVALID_NULL_HANDLE";
     break;
-  case OL_ERRC_INVALID_NULL_POINTER:
-    os << "OL_ERRC_INVALID_NULL_POINTER";
-    break;
   case OL_ERRC_INVALID_SIZE:
     os << "OL_ERRC_INVALID_SIZE";
     break;
   case OL_ERRC_INVALID_ENUMERATION:
     os << "OL_ERRC_INVALID_ENUMERATION";
     break;
-  case OL_ERRC_UNSUPPORTED_ENUMERATION:
-    os << "OL_ERRC_UNSUPPORTED_ENUMERATION";
-    break;
   case OL_ERRC_UNKNOWN:
     os << "OL_ERRC_UNKNOWN";
     break;
diff --git a/offload/liboffload/src/OffloadImpl.cpp b/offload/liboffload/src/OffloadImpl.cpp
index bef72a7d1851a..b50c7e0f87b7c 100644
--- a/offload/liboffload/src/OffloadImpl.cpp
+++ b/offload/liboffload/src/OffloadImpl.cpp
@@ -311,8 +311,7 @@ ol_impl_result_t olMemAlloc_impl(ol_device_handle_t Device,
   auto Alloc =
       Device->Device->dataAlloc(Size, nullptr, convertOlToPluginAllocTy(Type));
   if (!Alloc)
-    return {OL_ERRC_OUT_OF_RESOURCES,
-            formatv("Could not create allocation on device {0}", Device).str()};
+    return ol_impl_result_t::fromError(Alloc.takeError());
 
   *AllocationOut = *Alloc;
   allocInfoMap().insert_or_assign(*Alloc, AllocInfo{Device, Type});
@@ -330,7 +329,7 @@ ol_impl_result_t olMemFree_impl(void *Address) {
   auto Res =
       Device->Device->dataDelete(Address, convertOlToPluginAllocTy(Type));
   if (Res)
-    return {OL_ERRC_OUT_OF_RESOURCES, "Could not free allocation"};
+    return ol_impl_result_t::fromError(std::move(Res));
 
   allocInfoMap().erase(Address);
 
@@ -342,7 +341,7 @@ ol_impl_result_t olCreateQueue_impl(ol_device_handle_t Device,
   auto CreatedQueue = std::make_unique<ol_queue_impl_t>(nullptr, Device);
   auto Err = Device->Device->initAsyncInfo(&(CreatedQueue->AsyncInfo));
   if (Err)
-    return {OL_ERRC_UNKNOWN, "Could not initialize stream resource"};
+    return ol_impl_result_t::fromError(std::move(Err));
 
   *Queue = CreatedQueue.release();
   return OL_SUCCESS;
@@ -358,7 +357,7 @@ ol_impl_result_t olWaitQueue_impl(ol_queue_handle_t Queue) {
   if (Queue->AsyncInfo->Queue) {
     auto Err = Queue->Device->Device->synchronize(Queue->AsyncInfo);
     if (Err)
-      return {OL_ERRC_INVALID_QUEUE, "The queue failed to synchronize"};
+      return ol_impl_result_t::fromError(std::move(Err));
   }
 
   // Recreate the stream resource so the queue can be reused
@@ -366,7 +365,7 @@ ol_impl_result_t olWaitQueue_impl(ol_queue_handle_t Queue) {
   // it to begin with.
   auto Res = Queue->Device->Device->initAsyncInfo(&Queue->AsyncInfo);
   if (Res)
-    return {OL_ERRC_UNKNOWN, "Could not reinitialize the stream resource"};
+    return ol_impl_result_t::fromError(std::move(Res));
 
   return OL_SUCCESS;
 }
@@ -374,7 +373,7 @@ ol_impl_result_t olWaitQueue_impl(ol_queue_handle_t Queue) {
 ol_impl_result_t olWaitEvent_impl(ol_event_handle_t Event) {
   auto Res = Event->Queue->Device->Device->syncEvent(Event->EventInfo);
   if (Res)
-    return {OL_ERRC_INVALID_EVENT, "The event failed to synchronize"};
+    return ol_impl_result_t::fromError(std::move(Res));
 
   return OL_SUCCESS;
 }
@@ -390,13 +389,17 @@ ol_impl_result_t olDestroyEvent_impl(ol_event_handle_t Event) {
 ol_event_handle_t makeEvent(ol_queue_handle_t Queue) {
   auto EventImpl = std::make_unique<ol_event_impl_t>(nullptr, Queue);
   auto Res = Queue->Device->Device->createEvent(&EventImpl->EventInfo);
-  if (Res)
+  if (Res) {
+    llvm::consumeError(std::move(Res));
     return nullptr;
+  }
 
   Res = Queue->Device->Device->recordEvent(EventImpl->EventInfo,
                                            Queue->AsyncInfo);
-  if (Res)
+  if (Res) {
+    llvm::consumeError(std::move(Res));
     return nullptr;
+  }
 
   return EventImpl.release();
 }
@@ -422,16 +425,16 @@ ol_impl_result_t olMemcpy_impl(ol_queue_handle_t Queue, void *DstPtr,
   if (DstDevice == HostDevice()) {
     auto Res = SrcDevice->Device->dataRetrieve(DstPtr, SrcPtr, Size, QueueImpl);
     if (Res)
-      return {OL_ERRC_UNKNOWN, "The data retrieve operation failed"};
+      return ol_impl_result_t::fromError(std::move(Res));
   } else if (SrcDevice == HostDevice()) {
     auto Res = DstDevice->Device->dataSubmit(DstPtr, SrcPtr, Size, QueueImpl);
     if (Res)
-      return {OL_ERRC_UNKNOWN, "The data submit operation failed"};
+      return ol_impl_result_t::fromError(std::move(Res));
   } else {
     auto Res = SrcDevice->Device->dataExchange(SrcPtr, *DstDevice->Device,
                                                DstPtr, Size, QueueImpl);
     if (Res)
-      return {OL_ERRC_UNKNOWN, "The data exchange operation failed"};
+      return ol_impl_result_t::fromError(std::move(Res));
   }
 
   if (EventOut)
@@ -459,7 +462,7 @@ ol_impl_result_t olCreateProgram_impl(ol_device_handle_t Device,
       Device->Device->loadBinary(Device->Device->Plugin, &Prog->DeviceImage);
   if (!Res) {
     delete Prog;
-    return OL_ERRC_INVALID_VALUE;
+    return ol_impl_result_t::fromError(Res.takeError());
   }
 
   Prog->Image = *Res;
@@ -483,7 +486,7 @@ ol_impl_result_t olGetKernel_impl(ol_program_handle_t Program,
 
   auto Err = KernelImpl->init(Device, *Program->Image);
   if (Err)
-    return {OL_ERRC_UNKNOWN, "Could not initialize the kernel"};
+    return ol_impl_result_t::fromError(std::move(Err));
 
   *Kernel = &*KernelImpl;
 
@@ -526,7 +529,7 @@ olLaunchKernel_impl(ol_queue_handle_t Queue, ol_device_handle_t Device,
 
   AsyncInfoWrapper.finalize(Err);
   if (Err)
-    return {OL_ERRC_UNKNOWN, "Could not finalize the AsyncInfoWrapper"};
+    return ol_impl_result_t::fromError(std::move(Err));
 
   if (EventOut)
     *EventOut = makeEvent(Queue);
diff --git a/offload/plugins-nextgen/common/include/OffloadErrcodes.inc b/offload/plugins-nextgen/common/include/OffloadErrcodes.inc
new file mode 100644
index 0000000000000..146a2cd0ce0bf
--- /dev/null
+++ b/offload/plugins-nextgen/common/include/OffloadErrcodes.inc
@@ -0,0 +1,37 @@
+//===- Auto-generated file, part of the LLVM/Offload project --------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef OFFLOAD_ERRC
+#error Please define the macro OFFLOAD_ERRCODE(Name, Desc, Value)
+#endif
+
+// Error codes are shared between PluginInterface and liboffload.
+// To add new error codes, add them to offload/liboffload/API/Common.td and run
+// the GenerateOffload target.
+
+OFFLOAD_ERRC(SUCCESS, "Success", 0)
+OFFLOAD_ERRC(INVALID_NULL_POINTER,
+             "A pointer argument is null when it should not be", 1)
+OFFLOAD_ERRC(INVALID_ARGUMENT, "An argument is invalid", 2)
+OFFLOAD_ERRC(OUT_OF_RESOURCES, "Out of resources", 3)
+OFFLOAD_ERRC(UNSUPPORTED,
+             "generic error code for unsupported features and enums", 4)
+OFFLOAD_ERRC(INVALID_VALUE, "Invalid Value", 4096)
+OFFLOAD_ERRC(INVALID_PLATFORM, "Invalid platform", 4097)
+OFFLOAD_ERRC(INVALID_DEVICE, "Invalid device", 4098)
+OFFLOAD_ERRC(INVALID_QUEUE, "Invalid queue", 4099)
+OFFLOAD_ERRC(INVALID_EVENT, "Invalid event", 4100)
+OFFLOAD_ERRC(INVALID_KERNEL_NAME,
+             "Named kernel not found in the program binary", 4101)
+OFFLOAD_ERRC(INVALID_NULL_HANDLE, "handle argument is not valid", 4102)
+OFFLOAD_ERRC(
+    INVALID_SIZE,
+    "invalid size or dimensions (e.g., must not be zero, or is out of bounds)",
+    4103)
+OFFLOAD_ERRC(INVALID_ENUMERATION, "enumerator argument is not valid", 4104)
+OFFLOAD_ERRC(UNKNOWN, "Unknown or internal error", 65536)
diff --git a/offload/plugins-nextgen/common/include/PluginInterface.h b/offload/plugins-nextgen/common/include/PluginInterface.h
index e54a8afdd3f4f..19fe26b10760b 100644
--- a/offload/plugins-nextgen/common/include/PluginInterface.h
+++ b/offload/plugins-nextgen/common/include/PluginInterface.h
@@ -58,6 +58,30 @@ struct GenericKernelTy;
 struct GenericDeviceTy;
 struct RecordReplayTy;
 
+enum class ErrorCode {
+#define OFFLOAD_ERRC(Name, _, Value) Name = Value,
+#include "OffloadErrcodes.inc"
+#undef OFFLOAD_ERRC
+};
+
+class OffloadErrorCategory : public std::error_category {
+  const char *name() const noexcept override { return "Offload Error"; }
+  std::string message(int ev) const override {
+    switch (static_cast<ErrorCode>(ev)) {
+#define OFFLOAD_ERRC(Name, Desc, Value)                                        \
+  case ErrorCode::Name:                                                        \
+    return #Desc;
+#include "OffloadErrcodes.inc"
+#undef OFFLOAD_ERRC
+    }
+  }
+};
+
+inline std::error_code make_error_code(ErrorCode EC) {
+  static OffloadErrorCategory Cat{};
+  return {static_cast<int>(EC), Cat};
+}
+
 /// Class that wraps the __tgt_async_info to simply its usage. In case the
 /// object is constructed without a valid __tgt_async_info, the object will use
 /// an internal one and will synchronize the current thread with the pending
@@ -1385,7 +1409,13 @@ static inline Error success() { return Error::success(); }
 /// Create a string error.
 template <typename... ArgsTy>
 static Error error(const char *ErrFmt, ArgsTy... Args) {
-  return createStringError(inconvertibleErrorCode(), ErrFmt, Args...);
+  return createStringError(ErrorCode::UNKNOWN, ErrFmt, Args...);
+}
+
+/// Create a strin...
[truncated]

@@ -477,7 +475,8 @@ OL_APIEXPORT ol_result_t OL_APICALL olMemFree(
/// @brief Enqueue a memcpy operation.
///
/// @details
/// - For host pointers, use the device returned by olGetHostDevice
/// - For host pointers, use the host device belonging to the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't part of this MR, but I assume a previous update to the comments didn't get regen-ed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's my bad. If we continue to check in the generated files it would be useful to have CI catch this by checking there's no diff after regenerating.

Copy link

github-actions bot commented May 2, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Etor<"UNSUPPORTED_ENUMERATION", "enumerator argument is not supported by the device">,
Etor<"UNKNOWN", "Unknown or internal error">

Etor<"UNKNOWN", "Unknown or internal error", 0x10000>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have the same enum value as invalid value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, this is 10,000 while invalid value is 1000.

Would UINT32_MAX make more sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed the extra zero. Why do these need to be so separate in the first place? I really don't think there should be a distinction between these errors at all, so we'd just want 0x1000 to be a 'generic' error code and then everything after that to be something more specific both the 'plugin' implementation and the API use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design I'm considering is to partition errors into several categories:

  • "Common" errors that are shared across all components.
  • Errors that only make sense returned from liboffload.
  • HSA specific errors.
  • Cuda specific errors.

There are a number of errors that only make sense when returned from liboffload (a plugin will never return INVALID_DEVICE since it doesn't handle device handles), which is why there is a division there. Especially since in theory these error codes could be used by things other than liboffload.

UNKNOWN being the highest value makes sense to me, but I can't exactly articulate why so I'll just move it into the "common" group.

@callumfare HSA and Cuda also have specific features that only make sense for them and no other backend. But I agree that maybe the "NATIVE_ERROR" could just map to "UNKNOWN". It's something to probably discuss in the next MR though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head the whole point of this is to abstract over these platforms. If we're just providing hooks into the underlying implementation people can just use that instead and cut out the middle man. What's the benefit to having these specific error codes instead of just having a completely common set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that it was desirable to give backends the ability to pass backend-specific error codes to PluginInterface users. If that is not the case, I'm perfectly happy to axe the categories thing and have them all be in the same pool.

Thinking out loud here, maybe having a field for the CUResult/hsa_status_t in the ol_result_t struct might be a better solution to this? But that's a future discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some concept of interop, but I don't really remember how fleshed out it is. The idea with interop is that you'd get the underlying object for that platform and then you could do you own error handling on it. I don't think that requires us to forward their error codes, since we already convert from the CUDA / HSA error codes to the LLVM error type ASAP currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed a new commit that removes the "partitioning" entirely.

@RossBrunton
Copy link
Contributor Author

RossBrunton commented May 2, 2025

RossBrunton@2b4bcd3 Example of how this new error reporting works in practice (figured It'd be best to split it into a new PR once this is merged).

@callumfare
Copy link
Contributor

RossBrunton@2b4bcd3 Example of how this new error reporting works in practice (figured It'd be best to split it into a new commit).

Overall LGTM to me but one thing I don't like about this proposed use is that the plugin-specific error categories make it hard for runtimes and other users to handle errors from the library in a generic way. Especially OL_ERRC_CUDA_NATIVE_UNKNOWN_ERROR and OL_ERRC_HSA_NATIVE_UNKNOWN_ERROR - why not just have the plugins return the generic UNKNOWN_ERROR code? In SYCL these codes can end up propagated up to a sycl::exception. Having to special case for CUDA, HSA specific errors defeats the point a bit. The detail strings still contain backend-specific information which I think is fine.

A new ErrorCode enumeration is present in PluginInterface which can
be used when returning an llvm::Error from offload and PluginInterface
functions.

This enum must be kept up to sync with liboffload's ol_errc_t enum, so
both are automatically generated from liboffload's enum definition.

Some error codes have also been shuffled around to allow for future
work. Note that this patch only adds the machinery; actual error codes
will be added in a future patch.
// the GenerateOffload target.

OFFLOAD_ERRC(SUCCESS, "Success", 0)
OFFLOAD_ERRC(UNKNOWN, "Unknown or internal error", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we even need these string arguments. The main benefit of doing errors the way we do with the LLVM error type and a tuple is that we can create error codes that are more descriptive. I suppose they serve as a good default if someone doesn't specify, but realistically the intended string should be generated where the error is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::error_code requires a message function that returns a string description of the error code, so if you pull the error code out of the llvm::Error and query it directly (or just create a new one out of the blue), it will print that message. I don't know why you'd do that, but you can.

I imagine having descriptions for the error codes is also probably useful for users of the C API that just want to use the error code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess it serves as a form of comment for usage, but I'm just saying in general that these will likely be unused internally.

Comment on lines +80 to +83
inline std::error_code make_error_code(ErrorCode EC) {
static OffloadErrorCategory Cat{};
return {static_cast<int>(EC), Cat};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we really need std::error_code for? I thought we were just going from LLVM error to error code + string pair in the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error codes inside llvm::Error are a std::error_code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering if this is really necessary, going on comments like this which do something similar to what you're doing here

// FIXME: This class is only here to support the transition to llvm::Error. It
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The createStringError message takes a std::error_code as a parameter, which is why I added all the error_code machinery.

I also think it makes sense to use here, since it also allows us to inspect whether a given error code is an "offload" error code or some other kind of error that has snuck in (which is what GetErrorCode does).

Copy link
Contributor

@jhuber6 jhuber6 May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but shouldn't we be able to use https://en.cppreference.com/w/cpp/error/errc/make_error_code for the generic category? I don't think having an offload category helps us here since it's not used for anything. The string message is going to come from the person who created the error, so we only care about the numerical part of the error code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants