-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Description
Answers checklist.
- I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
- I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
- I have searched the issue tracker for a similar issue and not found a similar issue.
IDF version.
v5.2.2, also v5.2.2-639-g43098fc4de
Espressif SoC revision.
ESP32-C3 (QFN32) (revision v0.4)
Operating System used.
Linux
How did you build your project?
Command line with idf.py
If you are using Windows, please specify command line type.
None
Development Kit.
SEEED XIAO ESP32-C3
Power Supply used.
USB
What is the expected behavior?
Load SP register with a valid address (inside the current task's stack region) without Debug Assist hardware Stack Protection triggering.
What is the actual behavior?
Loading the SP register seems to intermittently trigger a hardware stack protector interrupt. All of the reported addresses look valid for the running task, i.e. there was no stack overflow or SP corruption.
Steps to reproduce.
Reproduction currently requires the MicroPython master branch and some Python code that sends a lot of data over Wi-Fi. (The original bug is micropython/micropython#15667)
It is probably possible to make a simpler reproducer, best guess is that the key features are:
- Frequent context switches and/or interrupts (or maybe something else to do with Wi-Fi activity, but guess interrupts).
- Execution in the task is jumping around using a setjmp/longjmp style mechanism. For MicroPython this is "native NLR" implemented here: https://github.com/micropython/micropython/blob/master/py/nlrrv32.c#L53 however the original issue report was using libc setjmp/longjmp.
Note that all of the jumps are happening within the same task, and the stack pointer is saved and restored each time to/from a valid value for the current executing task.
Debug Logs.
Here's a sample crash:
MPY version : v1.24.0-preview.201.g24aa8ed762.dirty on 2024-08-28
IDF version : v5.2.2
Machine : ESP32C3 module with ESP32C3
Guru Meditation Error: Core 0 panic'ed (Stack protection fault).
Detected in task "mp_task" at 0x4200b1ee
0x4200b1ee: nlr_jump at /home/gus/ry/george/micropython/py/nlrrv32.c:55
Stack pointer: 0x3fca7ff0
Stack bounds: 0x3fca43a4 - 0x3fca83a0
Core 0 register dump:
Stack dump detected
MEPC : 0x4200b200 RA : 0x403829fa SP : 0x3fca7ff0 GP : 0x3fc96e00
0x4200b200: nlr_jump at /home/gus/ry/george/micropython/py/nlrrv32.c:55
0x403829fa: mp_execute_bytecode at /home/gus/ry/george/micropython/py/vm.c:285
TP : 0x3fc6b838 T0 : 0x3fca7fa0 T1 : 0x40390f52 T2 : 0x0000003f
0x40390f52: vTaskSuspend at /home/gus/ry/george/esp-idf-v5/components/freertos/FreeRTOS-Kernel/tasks.c:1960 (discriminator 1)
S0/FP : 0x3fcabbe0 S1 : 0x3fcabc30 A0 : 0x3fca8010 A1 : 0x00000054
A2 : 0x00000000 A3 : 0x3fcc99c0 A4 : 0x3fcc99c0 A5 : 0x3fca80e0
A6 : 0x00000002 A7 : 0x21400000 S2 : 0x3c17034c S3 : 0x3fcc9950
S4 : 0x00000001 S5 : 0x00000062 S6 : 0x00000068 S7 : 0x3c16dc1c
S8 : 0x0000001b S9 : 0x3c16e000 S10 : 0x3c178419 S11 : 0x3c1781b6
T3 : 0x00000000 T4 : 0x0003877f T5 : 0x00000003 T6 : 0x00000001
MSTATUS : 0x00001881 MTVEC : 0x40380001 MCAUSE : 0x0000001b MTVAL : 0x00004505
0x40380001: _vector_table at ??:?
MHARTID : 0x00000000
Backtrace:
0x4200b200 in nlr_jump (val=0x3fcabc30) at /home/gus/ry/george/micropython/py/nlrrv32.c:55
55 __asm volatile (
#0 0x4200b200 in nlr_jump (val=0x3fcabc30) at /home/gus/ry/george/micropython/py/nlrrv32.c:55
#1 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
ELF file SHA256: 7e6b188d6
Note that the Stack pointer address in the dump is valid for the bounds of the task.
This crash dump was created with a couple of additions in the nlr_jump function to try and get extra debug info:
200b184 <nlr_jump>:
"sw x2, 60(x10) \n" // Store SP.
"jal x0, nlr_push_tail \n" // Jump to the C part.
);
}
NORETURN void nlr_jump(void *val) {
4200b184: 1141 addi sp,sp,-16
4200b186: c226 sw s1,4(sp)
4200b188: c04a sw s2,0(sp)
4200b18a: c606 sw ra,12(sp)
4200b18c: 84aa mv s1,a0
MP_NLR_JUMP_HEAD(val, top)
4200b18e: dc1fc0ef jal ra,42007f4e <mp_thread_get_state>
4200b192: 01452903 lw s2,20(a0)
4200b196: c422 sw s0,8(sp)
4200b198: 00091563 bnez s2,4200b1a2 <nlr_jump+0x1e>
4200b19c: 8526 mv a0,s1
4200b19e: ec2fa0ef jal ra,42005860 <nlr_jump_fail>
4200b1a2: 842a mv s0,a0
4200b1a4: 00992223 sw s1,4(s2)
4200b1a8: 854a mv a0,s2
4200b1aa: 71a420ef jal ra,4204d8c4 <nlr_call_jump_callbacks>
4200b1ae: 00092783 lw a5,0(s2)
4200b1b2: c85c sw a5,20(s0)
__asm volatile (
4200b1b4: 854a mv a0,s2
4200b1b6: 000102b3 add t0,sp,zero // Note: stored pre-restore SP to t0
4200b1ba: 00852083 lw ra,8(a0)
4200b1be: 4540 lw s0,12(a0)
4200b1c0: 4904 lw s1,16(a0)
4200b1c2: 01452903 lw s2,20(a0)
4200b1c6: 01852983 lw s3,24(a0)
4200b1ca: 01c52a03 lw s4,28(a0)
4200b1ce: 02052a83 lw s5,32(a0)
4200b1d2: 02452b03 lw s6,36(a0)
4200b1d6: 02852b83 lw s7,40(a0)
4200b1da: 02c52c03 lw s8,44(a0)
4200b1de: 03052c83 lw s9,48(a0)
4200b1e2: 03452d03 lw s10,52(a0)
4200b1e6: 03852d83 lw s11,56(a0)
4200b1ea: 03c52103 lw sp,60(a0)
4200b1ee: 0001 nop // <-- address the Debug Assist reports
4200b1f0: 0001 nop
4200b1f2: 0001 nop
4200b1f4: 0001 nop
4200b1f6: 0001 nop
4200b1f8: 0001 nop
4200b1fa: 0001 nop
4200b1fc: 0001 nop
4200b1fe: 0001 nop
4200b200: 4505 li a0,1 // <-- MEPC when the protection actually triggers
4200b202: 00008067 ret
- The debug assist always points to the instruction after loading SP as the one which triggered protection.
- Adding the
add t0,sp,zero
means temp register t0 holds the "before restore" SP value in the crash dump. Note that this SP value is also inside the task bounds. - Note that neither SP value is close to the stack limit. I doubled the task stack size and re-tested just in case, it crashes the same.
- Adding the NOPs at the end means that the exception register dump is valid for all register values at the time of triggering (otherwise the CPU exeception triggers a couple of instructions after returning which makes it harder to follow). This is also why the Backtrace doesn't decode here (the SP doesn't point to the executing frame as it's just been updated). The stack isn't corrupt though, if you take the NOPs out then the Backtrace decodes correctly.
More Information.
- Suspect probably a context switch or an interrupt that triggers immediately before or after the
lw sp,60(a0)
instruction is causing the stack protection to trigger. - Tried some simple patches in
components/esp_system/port/include/private/esp_private/hw_stack_guard.h
such as addingfence
instructions and bignop
blocks at the end ofESP_HW_STACK_GUARD_MONITOR_STOP_CPU0
andESP_HW_STACK_GUARD_MONITOR_START_CPU0
macros, in case there was some race with the Debug Assist registers changing during a context switch. Still crashes, however I don't really know what I'm doing there. - Have NOT tried disabling interrupts inside
nlr_jump
. That seems like a possible workaround but also doesn't seem like it should be necessary...?
Happy to try anything you recommend, might even be able to provide a C reproducer that uses setjmp/longjmp.