Correct usage of `cuda.core._memory.Buffer`?

I am trying to allocate workspace for cublaslt using cuda.core. First, I allocate a memory Buffer like so:
```python
device = Device()
device.set_current()
buffer = device.allocate(size=size, stream=stream)
raw_workspace_ptr: int = buffer.handle.getPtr()
```
Then later I pass this pointer to cublaslt via the nvmath-python bindings like so:
```python
cublaslt.matmul(
    self.handle,
    self.mm_desc,
    self.alpha.ctypes.data,
    a.data_ptr,
    self.a_layout_ptr,
    b.data_ptr,
    self.b_layout_ptr,
    self.beta.ctypes.data,
    c_ptr,
    self.c_layout_ptr,
    self.result.data_ptr,
    self.d_layout_ptr,
    algorithm_struct.ctypes.data,
    raw_workspace_ptr,  # pointer here
    self.workspace_size,  # same size used here as to allocate the buffer
    stream_holder.ptr,
)
```
The problem is that when I use this Buffer abstraction from cuda.core, I get errors from CUDA runtime. For example, when running with compute-sanitizer:
```
========= Invalid __global__ write of size 4 bytes
=========     at void dot_kernel<float, (int)128, (int)0, cublasDotParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>>>(T4)+0xd70
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f1ac5d49420 is out of bounds
=========     and is 139697130345438 bytes after the nearest allocation at 0xd00000000 of size 67 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x7ee3eb5] in libcublasLt.so.12
=========         Host Frame:  [0x7f4a3f7] in libcublasLt.so.12
=========         Host Frame:  [0x1b1ab14] in libcublasLt.so.12
=========         Host Frame:  [0x1b1c010] in libcublasLt.so.12
=========         Host Frame:  [0xf81c1d] in libcublasLt.so.12
=========         Host Frame:  [0x10c0b58] in libcublasLt.so.12
=========         Host Frame: cublasLtMatmul [0x10c4dcc] in libcublasLt.so.12
=========         Host Frame: __pyx_f_6nvmath_8bindings_10cycublasLt_cublasLtMatmul(void*, void*, void const*, void const*, void*, void const*, void*, void const*, void const*, void*, void*, void*, cublasLtMatmulAlgo_t const*, void*, unsigned long, CUstream_st*) [0x57b5] in cycublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_f_6nvmath_8bindings_8cublasLt_matmul(long, long, long, long, long, long, long, long, long, long, long, long, long, long, unsigned long, long, int) [0x5ca7d] in cublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_pw_6nvmath_8bindings_8cublasLt_13matmul(_object*, _object* const*, long, _object*) [0x78fae] in cublasLt.cpython-312-x86_64-linux-gnu.so
```
It seems to be reporting that the buffer is an invalid memory address. When I use the allocators provided by CuPy or pytorch, there are no errors.

Looking for opinions on:
 - Whether I am allocating / using this Buffer in the expected manner
 - How I could create a reproducer or another memory validator test that doesn't require setting up and entire matmul for cublaslt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct usage of `cuda.core._memory.Buffer`? #557

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Correct usage of cuda.core._memory.Buffer? #557

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Correct usage of `cuda.core._memory.Buffer`? #557