Skip to content

[AMDGPU] StackSlotColoring before LowerSGPRSpill causes memfault in Blender 4.1 on gfx908 #96353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yxsamliu opened this issue Jun 21, 2024 · 9 comments · Fixed by #93668
Closed
Assignees

Comments

@yxsamliu
Copy link
Collaborator

#93668 caused a regression in Blender 4.1 on gfx908 for all scenes.

Fra:48 Mem:1531.14M (Peak 1531.14M) | Time:06:58.63 | Mem:2448.05M, Peak:2448.05M | Scene, View Layer | Sample 0/256
Warning: precise memory violation signal reporting is not enabled, reported
location may not be accurate.  See "show amdgpu precise-memory".

Thread 103 "blender" received signal SIGSEGV, Segmentation fault.
[Switching to thread 103, lane 9 (AMDGPU Lane 1:1:1:1/9 (0,0,0)[9,0,0])]
0x00007ffd7689a68c in integrator_intersect_shadow(KernelGlobalsGPU const*, int) () from memory://192866#offset=0x7ffd77f01240&size=4555968
@llvmbot
Copy link
Member

llvmbot commented Jun 21, 2024

@llvm/issue-subscribers-backend-amdgpu

Author: Yaxun (Sam) Liu (yxsamliu)

https://github.com//pull/93668 caused a regression in Blender 4.1 on gfx908 for all scenes.
Fra:48 Mem:1531.14M (Peak 1531.14M) | Time:06:58.63 | Mem:2448.05M, Peak:2448.05M | Scene, View Layer | Sample 0/256
Warning: precise memory violation signal reporting is not enabled, reported
location may not be accurate.  See "show amdgpu precise-memory".

Thread 103 "blender" received signal SIGSEGV, Segmentation fault.
[Switching to thread 103, lane 9 (AMDGPU Lane 1:1:1:1/9 (0,0,0)[9,0,0])]
0x00007ffd7689a68c in integrator_intersect_shadow(KernelGlobalsGPU const*, int) () from memory://192866#offset=0x7ffd77f01240&size=4555968

@vg0204 vg0204 self-assigned this Jun 24, 2024
vg0204 added a commit to vg0204/llvm-project that referenced this issue Jun 24, 2024
This reverts commit 4b9112e. A separate
issue(llvm#96353) describing it has been opened to further keep its track.
@vg0204
Copy link
Contributor

vg0204 commented Jun 27, 2024

After spending time on ISA of good/bad cases, I found that s[0:3] represents the ScratchRsrcReg here, which is used as buffer descriptor(input) for buffer loads & stores, and thus most probably never get redefined or clobbered {since they are reserved & marked live throughout function}. Would it be better if I execute the given assembly & try to debug it?

@yxsamliu
Copy link
Collaborator Author

It seems DC's patches fix the issue, so we do not need to revert your patch.

93526 : [AMDGPU] Split VGPR regalloc pipeline

93525: [CodeGen] change prototype of regalloc filter function

@vg0204
Copy link
Contributor

vg0204 commented Jun 27, 2024

DC's patches

It seems DC's patches fix the issue, so we do not need to revert your patch.

93526 : [AMDGPU] Split VGPR regalloc pipeline

93525: [CodeGen] change prototype of regalloc filter function

How should I ensure that, as I already reverted the patch, before reverting the revert commit.

@yxsamliu
Copy link
Collaborator Author

I have verified that with those patches on top of amd-staging branch.

I will verify them with LLVM main branch.

Since you have reverted your PR. Let's re-enable blender test in buildbot and wait for these two PR's committed then reland your PR.

@vg0204
Copy link
Contributor

vg0204 commented Jun 27, 2024

Yeah, sounds good!

AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this issue Jul 9, 2024
This reverts commit 4b9112e. A separate
issue(llvm#96353) describing it has been opened to further keep its track.
@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Oct 1, 2024

Yeah, sounds good!

CD has committed his change #93526

I think now you can reland your patch. Thanks.

vg0204 added a commit to vg0204/llvm-project that referenced this issue Oct 3, 2024
This reverts commit c2fc7f7. As the
dependent patch about split vgpr regalloc pipeline solved the issue(llvm#96353).
@vg0204
Copy link
Contributor

vg0204 commented Nov 5, 2024

@yxsamliu , I relanded the patch, can you please once check for the failing blender test to verify its correctness, so this issue can be finally closed!

@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Nov 5, 2024

@yxsamliu , I relanded the patch, can you please once check for the failing blender test to verify its correctness, so this issue can be finally closed!

test is passing. thanks

@yxsamliu yxsamliu closed this as completed Nov 5, 2024
@vg0204 vg0204 linked a pull request Dec 26, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants