Skip to content

[BUG]: compute-sanitzer finds CUDA_ERROR_INVALID_CONTEXT with default Device constructor #562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
carterbox opened this issue Apr 21, 2025 · 6 comments · May be fixed by #566
Open
1 task done

[BUG]: compute-sanitzer finds CUDA_ERROR_INVALID_CONTEXT with default Device constructor #562

carterbox opened this issue Apr 21, 2025 · 6 comments · May be fixed by #566
Labels
bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Comments

@carterbox
Copy link
Contributor

carterbox commented Apr 21, 2025

Is this a duplicate?

Type of Bug

Silent Failure

Component

cuda.core v0.2.0 cuda-bindings v12.8.0

Describe the bug

Trying to initialize a Device with the current context when there is no current context causes an CUDA_ERROR_INVALID_CONTEXT.

How to Reproduce

Run the following script with compute-sanitizer:

import cuda.core.experimental as ccx

if __name__ == "__main__":
    d = ccx.Device(None)
    d.set_current()

Expected behavior

Device(None) should choose a default device (perhaps System.devices[0]?) to create a context if there is no existing current context.

Otherwise, if you are writing library code and you want to get the current device context but there is no guarantee that the user has already created one, then how do you get it without raising an error?

The error should be caught and invisible to compute-sanitizer.

Operating System

WSL2

nvidia-smi output

Mon Apr 21 21:09:28 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.02              Driver Version: 566.03         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 1000 Ada Gene...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   36C    P4              7W /   35W |       0MiB /   6141MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
@carterbox carterbox added the bug Something isn't working label Apr 21, 2025
@kkraus14 kkraus14 added P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Apr 21, 2025
@carterbox
Copy link
Contributor Author

I think this bug may have been fixed on main already in the cuda.bindings package. I tried to explore a fix, but when I compiled cuda-bindings from main locally, my reproducer didn't reproduce. When I reverted the cuda-bindings package to the v12.8.0 tag, I was able to reproduce.

@leofang
Copy link
Member

leofang commented Apr 22, 2025

I think this bug may have been fixed on main already in the cuda.bindings package.

I actually don't understand what happens and need to look into it. cuda.core code hasn't changed much since you worked on it last time... Are you suggesting the bug is in cuda.bindings not cuda.core?

@leofang leofang added the triage Needs the team's attention label Apr 22, 2025
@carterbox
Copy link
Contributor Author

Are you suggesting the bug is in cuda.bindings not cuda.core?

Yes. I definitely should have included the log output from this error. Here:

(test1) dching@NV-6V0FV44:~/Documents/cuda-python/cuda_core/tests$ compute-sanitizer python no-context.py
========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========         Host Frame:  [0x504f7] in cydriver.cpython-313-x86_64-linux-gnu.so
=========         Host Frame:  [0x1b7b5] in cyruntime.cpython-313-x86_64-linux-gnu.so
=========         Host Frame:  [0x10df9] in cyruntime.cpython-313-x86_64-linux-gnu.so
=========         Host Frame:  [0x1e0619] in runtime.cpython-313-x86_64-linux-gnu.so
=========         Host Frame: PyObject_Vectorcall in call.c:327 [0x1b08fd] in python
=========         Host Frame: _PyEval_EvalFrameDefault.cold in generated_cases.c.h:813 [0x9e87c] in python
=========         Host Frame: slot_tp_new in typeobject.c:9832 [0x203de4] in python
=========         Host Frame: _PyObject_MakeTpCall in call.c:242 [0x1ae79e] in python
=========         Host Frame: _PyEval_EvalFrameDefault.cold in generated_cases.c.h:813 [0x9e87c] in python
=========         Host Frame: PyEval_EvalCode in ceval.c:604 [0x26d7c0] in python
=========         Host Frame: run_eval_code_obj in pythonrun.c:1381 [0x2aca8f] in python
=========         Host Frame: run_mod in pythonrun.c:1466 [0x2aa49b] in python
=========         Host Frame: pyrun_file in pythonrun.c:1295 [0x2a7435] in python
=========         Host Frame: _PyRun_SimpleFileObject in pythonrun.c:517 [0x2a7087] in python
=========         Host Frame: _PyRun_AnyFileObject in pythonrun.c:77 [0x2a6e7b] in python
=========         Host Frame: Py_RunMain in main.c:775 [0x2a508d] in python
=========         Host Frame: Py_BytesMain in main.c:829 [0x2596c6] in python
=========         Host Frame:  [0x29d8f] in libc.so.6
=========         Host Frame: __libc_start_main [0x29e3f] in libc.so.6
=========         Host Frame:  [0x258abd] in python
=========         Host Frame: __new__ in _device.py:960
=========         Host Frame: <module> in no-context.py:8
=========
========= ERROR SUMMARY: 1 error

@carterbox
Copy link
Contributor Author

We need to figure out what changes were made to the main branch that fixed this error and backport them to the 11.8.x branch.

@carterbox
Copy link
Contributor Author

In offline discussion with @leofang, we decided that it's too much effort to backport the fixes to CTK11 because the appropriate backport is probably #517.

@carterbox carterbox linked a pull request Apr 25, 2025 that will close this issue
2 tasks
@leofang leofang removed the triage Needs the team's attention label Apr 25, 2025
@leofang
Copy link
Member

leofang commented Apr 26, 2025

This is the minimal reproducer based on the traceback that Daniel provided above:

$ compute-sanitizer python -c "import cuda.bindings.runtime as runtime; runtime.cudaGetDevice()"

This happens because in the public releases cuda.bindings.runtime is implemented on top of the driver bindings, whereas after #517 it's no longer the case. We are seeing a bug in the cudart re-implementation (cc @vzhurba01 for vis). I suggested to document this as a bug fix in 12.9.0 and a known issue / won't fix in 11.8.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants