Closed
Description
I am trying to allocate workspace for cublaslt using cuda.core. First, I allocate a memory Buffer like so:
device = Device()
device.set_current()
buffer = device.allocate(size=size, stream=stream)
raw_workspace_ptr: int = buffer.handle.getPtr()
Then later I pass this pointer to cublaslt via the nvmath-python bindings like so:
cublaslt.matmul(
self.handle,
self.mm_desc,
self.alpha.ctypes.data,
a.data_ptr,
self.a_layout_ptr,
b.data_ptr,
self.b_layout_ptr,
self.beta.ctypes.data,
c_ptr,
self.c_layout_ptr,
self.result.data_ptr,
self.d_layout_ptr,
algorithm_struct.ctypes.data,
raw_workspace_ptr, # pointer here
self.workspace_size, # same size used here as to allocate the buffer
stream_holder.ptr,
)
The problem is that when I use this Buffer abstraction from cuda.core, I get errors from CUDA runtime. For example, when running with compute-sanitizer:
========= Invalid __global__ write of size 4 bytes
========= at void dot_kernel<float, (int)128, (int)0, cublasDotParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>>>(T4)+0xd70
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7f1ac5d49420 is out of bounds
========= and is 139697130345438 bytes after the nearest allocation at 0xd00000000 of size 67 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x7ee3eb5] in libcublasLt.so.12
========= Host Frame: [0x7f4a3f7] in libcublasLt.so.12
========= Host Frame: [0x1b1ab14] in libcublasLt.so.12
========= Host Frame: [0x1b1c010] in libcublasLt.so.12
========= Host Frame: [0xf81c1d] in libcublasLt.so.12
========= Host Frame: [0x10c0b58] in libcublasLt.so.12
========= Host Frame: cublasLtMatmul [0x10c4dcc] in libcublasLt.so.12
========= Host Frame: __pyx_f_6nvmath_8bindings_10cycublasLt_cublasLtMatmul(void*, void*, void const*, void const*, void*, void const*, void*, void const*, void const*, void*, void*, void*, cublasLtMatmulAlgo_t const*, void*, unsigned long, CUstream_st*) [0x57b5] in cycublasLt.cpython-312-x86_64-linux-gnu.so
========= Host Frame: __pyx_f_6nvmath_8bindings_8cublasLt_matmul(long, long, long, long, long, long, long, long, long, long, long, long, long, long, unsigned long, long, int) [0x5ca7d] in cublasLt.cpython-312-x86_64-linux-gnu.so
========= Host Frame: __pyx_pw_6nvmath_8bindings_8cublasLt_13matmul(_object*, _object* const*, long, _object*) [0x78fae] in cublasLt.cpython-312-x86_64-linux-gnu.so
It seems to be reporting that the buffer is an invalid memory address. When I use the allocators provided by CuPy or pytorch, there are no errors.
Looking for opinions on:
- Whether I am allocating / using this Buffer in the expected manner
- How I could create a reproducer or another memory validator test that doesn't require setting up and entire matmul for cublaslt