Skip to content

Correct usage of cuda.core._memory.Buffer? #557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
carterbox opened this issue Apr 11, 2025 · 7 comments · Fixed by #573
Closed

Correct usage of cuda.core._memory.Buffer? #557

carterbox opened this issue Apr 11, 2025 · 7 comments · Fixed by #573
Labels
triage Needs the team's attention

Comments

@carterbox
Copy link
Contributor

carterbox commented Apr 11, 2025

I am trying to allocate workspace for cublaslt using cuda.core. First, I allocate a memory Buffer like so:

device = Device()
device.set_current()
buffer = device.allocate(size=size, stream=stream)
raw_workspace_ptr: int = buffer.handle.getPtr()

Then later I pass this pointer to cublaslt via the nvmath-python bindings like so:

cublaslt.matmul(
    self.handle,
    self.mm_desc,
    self.alpha.ctypes.data,
    a.data_ptr,
    self.a_layout_ptr,
    b.data_ptr,
    self.b_layout_ptr,
    self.beta.ctypes.data,
    c_ptr,
    self.c_layout_ptr,
    self.result.data_ptr,
    self.d_layout_ptr,
    algorithm_struct.ctypes.data,
    raw_workspace_ptr,  # pointer here
    self.workspace_size,  # same size used here as to allocate the buffer
    stream_holder.ptr,
)

The problem is that when I use this Buffer abstraction from cuda.core, I get errors from CUDA runtime. For example, when running with compute-sanitizer:

========= Invalid __global__ write of size 4 bytes
=========     at void dot_kernel<float, (int)128, (int)0, cublasDotParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>>>(T4)+0xd70
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f1ac5d49420 is out of bounds
=========     and is 139697130345438 bytes after the nearest allocation at 0xd00000000 of size 67 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x7ee3eb5] in libcublasLt.so.12
=========         Host Frame:  [0x7f4a3f7] in libcublasLt.so.12
=========         Host Frame:  [0x1b1ab14] in libcublasLt.so.12
=========         Host Frame:  [0x1b1c010] in libcublasLt.so.12
=========         Host Frame:  [0xf81c1d] in libcublasLt.so.12
=========         Host Frame:  [0x10c0b58] in libcublasLt.so.12
=========         Host Frame: cublasLtMatmul [0x10c4dcc] in libcublasLt.so.12
=========         Host Frame: __pyx_f_6nvmath_8bindings_10cycublasLt_cublasLtMatmul(void*, void*, void const*, void const*, void*, void const*, void*, void const*, void const*, void*, void*, void*, cublasLtMatmulAlgo_t const*, void*, unsigned long, CUstream_st*) [0x57b5] in cycublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_f_6nvmath_8bindings_8cublasLt_matmul(long, long, long, long, long, long, long, long, long, long, long, long, long, long, unsigned long, long, int) [0x5ca7d] in cublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_pw_6nvmath_8bindings_8cublasLt_13matmul(_object*, _object* const*, long, _object*) [0x78fae] in cublasLt.cpython-312-x86_64-linux-gnu.so

It seems to be reporting that the buffer is an invalid memory address. When I use the allocators provided by CuPy or pytorch, there are no errors.

Looking for opinions on:

  • Whether I am allocating / using this Buffer in the expected manner
  • How I could create a reproducer or another memory validator test that doesn't require setting up and entire matmul for cublaslt
@github-actions github-actions bot added the triage Needs the team's attention label Apr 11, 2025
@carterbox
Copy link
Contributor Author

I should check that the address reported by compute sanitizer is near the integer pointer address that I get from the Buffer object.

@kkraus14
Copy link
Collaborator

@carterbox I think the simplest example would look something like:

from cuda.core.experimental import Device

size = ...  # define this

device = Device()
device.set_current()
stream = device.create_stream()

buffer1 = device.allocate(size=size, stream=stream)
buffer2 = device.allocate(size=size, stream=stream)
buffer1.copy_to(buffer2, stream=stream)

Based on your error above my best guess would be that something is either wrong with the size you're passing into the device.allocate call, or the value of stream_holder.ptr isn't referring to the same stream as stream.

@carterbox
Copy link
Contributor Author

I checking that the streams used at allocation and use-time were the same and noticed that the stream pointers I was getting from CuPy/Torch were not the same as the ones from cuda.core for the same stream. This lead me to realize that I was doing something wrong when converting cuda.core objects from python objects into addresses of the underlying C objects. For example:

raw_workspace_ptr: int = buffer.handle.getPtr()

This is incorrect! Because it returns the pointer to the python cuda.bindings object not the address of actual memory buffer. Instead we should do this:

raw_workspace_ptr: int = int(buffer.handle)

Which I guess is pythonic, but also not obvious or documented in the documentation of Buffer or Stream.

@kkraus14
Copy link
Collaborator

This is documented here: https://nvidia.github.io/cuda-python/cuda-bindings/latest/tips_and_tricks.html#getting-the-address-of-underlying-c-objects-from-the-low-level-bindings

But I agree this isn't the most clear and is prone to exactly the situation you ran into.

@carterbox
Copy link
Contributor Author

I'm thinking I want to contribute a documentation fix which either:

  1. Adds a note to all the cuda.core classes that you need to call int(Class().handle) to get the pointer address of the C object.
  2. Adds a docstring to __int__() for all cuda.bindings classes so that it's obvious from the documentation page that this is a valid operation. I maintain that it's not obvious enough that this operator is defined for theses classes.

Number 2 is probably the better approach?

@kkraus14
Copy link
Collaborator

@carterbox this was discussed in an offline meeting and it was generally agreed that we aren't happy with the current state of things with regards to getPtr() vs __int__() and the lack of intuitiveness for a Python developer.

I'm going to write up a new issue and close this one that captures the discussion and some next steps.

@leofang
Copy link
Member

leofang commented Apr 22, 2025

it was generally agreed that we aren't happy with the current state of things with regards to getPtr() vs __int__() and the lack of intuitiveness for a Python developer.

I'm going to write up a new issue and close this one that captures the discussion and some next steps.

We discussed further offline, and to move away from __int__() we will implement #564. @carterbox could you make necessary doc changes to cuda.core to clarify the status quo, as you suggested earlier?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs the team's attention
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants