Skip to content

Fix dangling pointer problem in _linker.py #516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 14, 2025

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Mar 13, 2025

This code

passes a pointer to a temporary object (for the name argument).

cuLinkAddData() documentation for easy reference:

cuLinkAddData() does not make a copy of const char* name, therefore we have to provide a pointer that outlives the CUlinkState object.

Copy link
Contributor

copy-pr-bot bot commented Mar 13, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 13, 2025

/ok to test

This comment has been minimized.

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 13, 2025

cuLinkAddData() does not make a copy of const char* name, therefore we have to provide a pointer that outlives the CUlinkState object.

This doesn't sound correct. The docs state:

Ownership of data is retained by the caller. No reference is retained to any inputs after this call returns.

This makes it sound like it makes a copy of the const char* name input.

From my perspective it sounds like we were just creating a dangling pointer that referenced garbage by the time the cuLinkAddData() call was hit? Presumably because of something related to the implicit conversion to const char* of f"{object_code._handle}_{object_code._code_type}".encode()?

@kkraus14
Copy link
Collaborator

Looks like we're hitting the second case of here: https://cython.readthedocs.io/en/latest/src/tutorial/strings.html#encoding-text-to-bytes

But we found a pattern that the Cython compiler doesn't catch as expected.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 13, 2025

I only shared this on chat before:

https://chatgpt.com/share/67d33da7-cc8c-8008-ba5a-1c9ca5614614

This isn't conclusive, but after seeing that I stopped looking further.

cuLinkAddData() does not make a copy of const char* name, therefore we have to provide a pointer that outlives the CUlinkState object.

This doesn't sound correct. The docs state:

Ownership of data is retained by the caller. No reference is retained to any inputs after this call returns.

That refers specifically to the data argument.

It is very common that const char * are not getting copied. (I had runs-ins with that many times.)

From my perspective it sounds like we were just creating a dangling pointer that referenced garbage by the time the cuLinkAddData() call was hit? Presumably because of something related to the implicit conversion to const char* of f"{object_code._handle}_{object_code._code_type}".encode()?

The temporary Python object only goes out of scope after the cuLinkAddData() call finishes.

BUT Caveat: We can be more certain that my fix is actually working only after QA re-testing succeeds.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 13, 2025

I made a silly mistake, fix coming in a minute.

A nice sideeffect is that we now have this list of platforms that surely exercise the code path with the fix:

32 failing checks

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 13, 2025

/ok to test

@leofang
Copy link
Member

leofang commented Mar 13, 2025

@kkraus14 @rwgk I raised a thread internally. I think this is the correct fix.

@rwgk could you add a release note entry for bug fix? I assume this exists since v0.1.1.

@leofang leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Mar 13, 2025
@leofang leofang added this to the cuda.core beta 3 milestone Mar 13, 2025
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 14, 2025

I'll merge this as-is, so we get this through QA testing tonight (with some luck). I think if I add the release notes in this PR, pushing the commit will mark your approvals as stale. We're missing a couple release notes already, so I'll take care of them all together in another PR.

@rwgk rwgk merged commit b500268 into NVIDIA:main Mar 14, 2025
76 checks passed
@rwgk rwgk deleted the cuLinkAddData_const_char_keep_alive branch March 14, 2025 02:48
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants