[CIR][CUDA] Miscellanous bugfixes #1462
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR deals with several issues currently present in CUDA CodeGen. Each of them requires only a few lines to fix, so they're combined in a single PR.
Bug 1.
Suppose we write
Then when we call this kernel with
cudaLaunchKernel
, the 4th argument to that function is something of the formvoid *kernel_args[2] = {&a, &b}
. OG allocates the space of it withalloca ptr, i32 2
, but that doesn't seem to be feasible in CIR, so we allocatedalloca [2 x ptr], i32 1
. This means there must be an extra GEP as compared to OG.In CIR, it means we must add an
array_to_ptrdecay
cast before trying to accessing the array elements. I missed that out in #1332 .Bug 2.
We missed a load instruction for 6th argument to
cudaLaunchKernel
. It's added back in this PR.Bug 3.
When we launch a kernel, we first retrieve the return value of
__cudaPopCallConfiguration
. If it's zero, then the call succeeds and we should proceed to call the device stub. In #1348 we did exactly the opposite, calling the device stub only if it's not zero. It's fixed here.Issue 4.
CallConvLowering is required to make
cudaLaunchKernel
correct. The codepath is unblocked by adding agetIndirectResult
at the same place as OG does -- the function is already implemented so we can just call it.After this (and other pending PRs), CIR is now able to compile real CUDA programs. There are still missing features, which will be followed up later.