Skip to content

Correct usage of cuda.core._memory.Buffer? #557

Closed
@carterbox

Description

@carterbox

I am trying to allocate workspace for cublaslt using cuda.core. First, I allocate a memory Buffer like so:

device = Device()
device.set_current()
buffer = device.allocate(size=size, stream=stream)
raw_workspace_ptr: int = buffer.handle.getPtr()

Then later I pass this pointer to cublaslt via the nvmath-python bindings like so:

cublaslt.matmul(
    self.handle,
    self.mm_desc,
    self.alpha.ctypes.data,
    a.data_ptr,
    self.a_layout_ptr,
    b.data_ptr,
    self.b_layout_ptr,
    self.beta.ctypes.data,
    c_ptr,
    self.c_layout_ptr,
    self.result.data_ptr,
    self.d_layout_ptr,
    algorithm_struct.ctypes.data,
    raw_workspace_ptr,  # pointer here
    self.workspace_size,  # same size used here as to allocate the buffer
    stream_holder.ptr,
)

The problem is that when I use this Buffer abstraction from cuda.core, I get errors from CUDA runtime. For example, when running with compute-sanitizer:

========= Invalid __global__ write of size 4 bytes
=========     at void dot_kernel<float, (int)128, (int)0, cublasDotParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>>>(T4)+0xd70
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f1ac5d49420 is out of bounds
=========     and is 139697130345438 bytes after the nearest allocation at 0xd00000000 of size 67 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x7ee3eb5] in libcublasLt.so.12
=========         Host Frame:  [0x7f4a3f7] in libcublasLt.so.12
=========         Host Frame:  [0x1b1ab14] in libcublasLt.so.12
=========         Host Frame:  [0x1b1c010] in libcublasLt.so.12
=========         Host Frame:  [0xf81c1d] in libcublasLt.so.12
=========         Host Frame:  [0x10c0b58] in libcublasLt.so.12
=========         Host Frame: cublasLtMatmul [0x10c4dcc] in libcublasLt.so.12
=========         Host Frame: __pyx_f_6nvmath_8bindings_10cycublasLt_cublasLtMatmul(void*, void*, void const*, void const*, void*, void const*, void*, void const*, void const*, void*, void*, void*, cublasLtMatmulAlgo_t const*, void*, unsigned long, CUstream_st*) [0x57b5] in cycublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_f_6nvmath_8bindings_8cublasLt_matmul(long, long, long, long, long, long, long, long, long, long, long, long, long, long, unsigned long, long, int) [0x5ca7d] in cublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_pw_6nvmath_8bindings_8cublasLt_13matmul(_object*, _object* const*, long, _object*) [0x78fae] in cublasLt.cpython-312-x86_64-linux-gnu.so

It seems to be reporting that the buffer is an invalid memory address. When I use the allocators provided by CuPy or pytorch, there are no errors.

Looking for opinions on:

  • Whether I am allocating / using this Buffer in the expected manner
  • How I could create a reproducer or another memory validator test that doesn't require setting up and entire matmul for cublaslt

Metadata

Metadata

Assignees

No one assigned

    Labels

    triageNeeds the team's attention

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions