Skip to content

ESP32-C3 Stack Protection Debug Assist module triggering on SP load (IDFGH-13568) #14456

@projectgus

Description

@projectgus

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.2.2, also v5.2.2-639-g43098fc4de

Espressif SoC revision.

ESP32-C3 (QFN32) (revision v0.4)

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

SEEED XIAO ESP32-C3

Power Supply used.

USB

What is the expected behavior?

Load SP register with a valid address (inside the current task's stack region) without Debug Assist hardware Stack Protection triggering.

What is the actual behavior?

Loading the SP register seems to intermittently trigger a hardware stack protector interrupt. All of the reported addresses look valid for the running task, i.e. there was no stack overflow or SP corruption.

Steps to reproduce.

Reproduction currently requires the MicroPython master branch and some Python code that sends a lot of data over Wi-Fi. (The original bug is micropython/micropython#15667)

It is probably possible to make a simpler reproducer, best guess is that the key features are:

  • Frequent context switches and/or interrupts (or maybe something else to do with Wi-Fi activity, but guess interrupts).
  • Execution in the task is jumping around using a setjmp/longjmp style mechanism. For MicroPython this is "native NLR" implemented here: https://github.com/micropython/micropython/blob/master/py/nlrrv32.c#L53 however the original issue report was using libc setjmp/longjmp.

Note that all of the jumps are happening within the same task, and the stack pointer is saved and restored each time to/from a valid value for the current executing task.

Debug Logs.

Here's a sample crash:

MPY version : v1.24.0-preview.201.g24aa8ed762.dirty on 2024-08-28
IDF version : v5.2.2
Machine     : ESP32C3 module with ESP32C3

Guru Meditation Error: Core  0 panic'ed (Stack protection fault). 

Detected in task "mp_task" at 0x4200b1ee
0x4200b1ee: nlr_jump at /home/gus/ry/george/micropython/py/nlrrv32.c:55

Stack pointer: 0x3fca7ff0
Stack bounds: 0x3fca43a4 - 0x3fca83a0


Core  0 register dump:
Stack dump detected
MEPC    : 0x4200b200  RA      : 0x403829fa  SP      : 0x3fca7ff0  GP      : 0x3fc96e00  
0x4200b200: nlr_jump at /home/gus/ry/george/micropython/py/nlrrv32.c:55
0x403829fa: mp_execute_bytecode at /home/gus/ry/george/micropython/py/vm.c:285

TP      : 0x3fc6b838  T0      : 0x3fca7fa0  T1      : 0x40390f52  T2      : 0x0000003f  
0x40390f52: vTaskSuspend at /home/gus/ry/george/esp-idf-v5/components/freertos/FreeRTOS-Kernel/tasks.c:1960 (discriminator 1)

S0/FP   : 0x3fcabbe0  S1      : 0x3fcabc30  A0      : 0x3fca8010  A1      : 0x00000054  
A2      : 0x00000000  A3      : 0x3fcc99c0  A4      : 0x3fcc99c0  A5      : 0x3fca80e0  
A6      : 0x00000002  A7      : 0x21400000  S2      : 0x3c17034c  S3      : 0x3fcc9950  
S4      : 0x00000001  S5      : 0x00000062  S6      : 0x00000068  S7      : 0x3c16dc1c  
S8      : 0x0000001b  S9      : 0x3c16e000  S10     : 0x3c178419  S11     : 0x3c1781b6  
T3      : 0x00000000  T4      : 0x0003877f  T5      : 0x00000003  T6      : 0x00000001  
MSTATUS : 0x00001881  MTVEC   : 0x40380001  MCAUSE  : 0x0000001b  MTVAL   : 0x00004505  
0x40380001: _vector_table at ??:?

MHARTID : 0x00000000  


Backtrace:


0x4200b200 in nlr_jump (val=0x3fcabc30) at /home/gus/ry/george/micropython/py/nlrrv32.c:55
55          __asm volatile (
#0  0x4200b200 in nlr_jump (val=0x3fcabc30) at /home/gus/ry/george/micropython/py/nlrrv32.c:55
#1  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
ELF file SHA256: 7e6b188d6

Note that the Stack pointer address in the dump is valid for the bounds of the task.

This crash dump was created with a couple of additions in the nlr_jump function to try and get extra debug info:

200b184 <nlr_jump>:
        "sw   x2, 60(x10)       \n" // Store SP.
        "jal  x0, nlr_push_tail \n" // Jump to the C part.
        );
}

NORETURN void nlr_jump(void *val) {
4200b184:       1141                    addi    sp,sp,-16
4200b186:       c226                    sw      s1,4(sp)
4200b188:       c04a                    sw      s2,0(sp)
4200b18a:       c606                    sw      ra,12(sp)
4200b18c:       84aa                    mv      s1,a0
    MP_NLR_JUMP_HEAD(val, top)
4200b18e:       dc1fc0ef                jal     ra,42007f4e <mp_thread_get_state>
4200b192:       01452903                lw      s2,20(a0)
4200b196:       c422                    sw      s0,8(sp)
4200b198:       00091563                bnez    s2,4200b1a2 <nlr_jump+0x1e>
4200b19c:       8526                    mv      a0,s1
4200b19e:       ec2fa0ef                jal     ra,42005860 <nlr_jump_fail>
4200b1a2:       842a                    mv      s0,a0
4200b1a4:       00992223                sw      s1,4(s2)
4200b1a8:       854a                    mv      a0,s2
4200b1aa:       71a420ef                jal     ra,4204d8c4 <nlr_call_jump_callbacks>
4200b1ae:       00092783                lw      a5,0(s2)
4200b1b2:       c85c                    sw      a5,20(s0)
    __asm volatile (
4200b1b4:       854a                    mv      a0,s2
4200b1b6:       000102b3                add     t0,sp,zero  // Note: stored pre-restore SP to t0
4200b1ba:       00852083                lw      ra,8(a0)
4200b1be:       4540                    lw      s0,12(a0)
4200b1c0:       4904                    lw      s1,16(a0)
4200b1c2:       01452903                lw      s2,20(a0)
4200b1c6:       01852983                lw      s3,24(a0)
4200b1ca:       01c52a03                lw      s4,28(a0)
4200b1ce:       02052a83                lw      s5,32(a0)
4200b1d2:       02452b03                lw      s6,36(a0)
4200b1d6:       02852b83                lw      s7,40(a0)
4200b1da:       02c52c03                lw      s8,44(a0)
4200b1de:       03052c83                lw      s9,48(a0)
4200b1e2:       03452d03                lw      s10,52(a0)
4200b1e6:       03852d83                lw      s11,56(a0)
4200b1ea:       03c52103                lw      sp,60(a0)
4200b1ee:       0001                    nop  // <-- address the Debug Assist reports
4200b1f0:       0001                    nop
4200b1f2:       0001                    nop
4200b1f4:       0001                    nop
4200b1f6:       0001                    nop
4200b1f8:       0001                    nop
4200b1fa:       0001                    nop
4200b1fc:       0001                    nop
4200b1fe:       0001                    nop
4200b200:       4505                    li      a0,1  // <-- MEPC when the protection actually triggers
4200b202:       00008067                ret
  • The debug assist always points to the instruction after loading SP as the one which triggered protection.
  • Adding the add t0,sp,zero means temp register t0 holds the "before restore" SP value in the crash dump. Note that this SP value is also inside the task bounds.
  • Note that neither SP value is close to the stack limit. I doubled the task stack size and re-tested just in case, it crashes the same.
  • Adding the NOPs at the end means that the exception register dump is valid for all register values at the time of triggering (otherwise the CPU exeception triggers a couple of instructions after returning which makes it harder to follow). This is also why the Backtrace doesn't decode here (the SP doesn't point to the executing frame as it's just been updated). The stack isn't corrupt though, if you take the NOPs out then the Backtrace decodes correctly.

More Information.

  • Suspect probably a context switch or an interrupt that triggers immediately before or after the lw sp,60(a0) instruction is causing the stack protection to trigger.
  • Tried some simple patches in components/esp_system/port/include/private/esp_private/hw_stack_guard.h such as adding fence instructions and big nop blocks at the end of ESP_HW_STACK_GUARD_MONITOR_STOP_CPU0 and ESP_HW_STACK_GUARD_MONITOR_START_CPU0 macros, in case there was some race with the Debug Assist registers changing during a context switch. Still crashes, however I don't really know what I'm doing there.
  • Have NOT tried disabling interrupts inside nlr_jump. That seems like a possible workaround but also doesn't seem like it should be necessary...?

Happy to try anything you recommend, might even be able to provide a C reproducer that uses setjmp/longjmp.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions