gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64 #132295

mpage · 2025-04-09T00:02:23Z

#131750 mysteriously caused a ~6% regression for the free-threaded build. The cause was poor code generation of opcode dispatch in the interpreter loop. Before the change the dispatch code looked like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19cd0a: mov    -0x268(%rbp),%rsi
          19cd11: movzbl %ah,%ecx
          19cd14: movzbl %al,%eax
          19cd17: mov    %ecx,%r10d
          19cd1a: jmp    *(%rsi,%rax,8)

After the change, the dispatch code looked like:

# Shared dispatch code
/root/src/cpython/Python/generated_cases.c.h:81 [BINARY_OP]
            DISPATCH();

          19dd67: mov    -0x280(%rbp),%r10
          19dd6e: movzbl %ah,%ecx
          19dd71: movzbl %al,%eax
          19dd74: mov    %ecx,%r14d
          19dd77: mov    -0x270(%rbp),%rcx
          19dd7e: mov    (%rcx,%rax,8),%rdx
          19dd82: nopw   0x0(%rax,%rax,1)
          19dd88: movq   -0x258(%rbp),%xmm0
          19dd90: movq   %r12,%xmm4
          19dd95: punpcklqdq %xmm4,%xmm0
          19dd99: movhlps %xmm0,%xmm3
          19dd9c: movq   %xmm0,%r15
          19dda1: movq   %xmm3,%r11
          19dda6: mov    %r11,%rcx
          19dda9: jmp    *%rdx
          
# Duplicated dispatch code
/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19dde4: movzbl %ah,%ecx
          19dde7: movzbl %al,%eax
          19ddea: mov    %ecx,%r14d
          19dded: mov    -0x270(%rbp),%rcx
          19ddf4: mov    (%rcx,%rax,8),%rdx
          19ddf8: jmp    19dd99 <_PyEval_EvalFrameDefault+0x289>

There are two problems:

We now have two jumps (one direct jump to the shared dispatch logic and one indirect jump to the next opcode handler) instead of one (the indirect jump to the opcode handler).
There's a significant amount of register shuffling in the shared dispatch code.

Both of these problems appear to be caused by GCC's SLP autovectorizer. After the change, it decides to store both the next_instr pointer and the stack_pointer in a single 128 bit register in the shared basic block that contains the opcode dispatch. This is introduced in the slp1 pass (tree dump below):

  _24061 = VIEW_CONVERT_EXPR<long unsigned int>(stack_pointer_14587);
  _24062 = VIEW_CONVERT_EXPR<long unsigned int>(next_instr_14097);
  _24063 = {_24062, _24061};

  <bb 19> [count: 1658034300]:
  # frame_2363(ab) = PHI <frame_20485(4258), frame_20519(18)>
  # oparg_1245(ab) = PHI <oparg_20252(4258), oparg_14635(18)>
  # next_instr_1246(ab) = PHI <next_instr_11924(4258), next_instr_14097(18)>
  # stack_pointer_2976(ab) = PHI <stack_pointer_20484(4258), stack_pointer_14587(18)>
  # _3209 = PHI <_20217(4258), _20681(18)>

  # 
  # Combination of next_instr and stack_pointer:
  # 

  # vect_next_instr_1246.7061_24064 = PHI <vect_next_instr_11924.7060_24060(4258), _24063(18)>
  _24067 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 64>;
  _24068(ab) = (union _PyStackRef *) _24067;
  _24065 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 0>;
  _24066(ab) = (union _Py_CODEUNIT *) _24065;

  # DEBUG stack_pointer => stack_pointer_2976(ab)
  # DEBUG next_instr => next_instr_1246(ab)
  # DEBUG oparg => oparg_1245(ab)
  # DEBUG frame => frame_2363(ab)
  goto _3209;

Disabling the SLP autovectorization pass for the interpreter loop fixes both problems. After this change the opcode dispatch code looks like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19aa37: mov    -0x260(%rbp),%rsi
          19aa3e: movzbl %ah,%ecx
          19aa41: movzbl %al,%eax
          19aa44: movslq %ecx,%r15
          19aa47: jmp    *(%rsi,%rax,8)

Performance improves by ~8% for the free-threaded build.

Surprisingly, this also seems to improve performance for the default build by ~4%. I don't understand why and I don't fully trust the result. The generated dispatch code for the default build looks unaffected by this change. Additionally, measuring instructions retired using fastbench shows a negligible change, whereas it shows a ~8% reduction for the free-threaded build.

Issue: computed-goto interpreter: Prevent the compiler from merging DISPATCH calls #129987

pinskia · 2025-04-09T02:06:50Z

I think this is the same as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777 .

colesbury

Nice!

…r loop on x86-64 (python#132295) The SLP autovectorizer can cause poor code generation for opcode dispatch, negating any benefit we get from vectorization elsewhere in the interpreter loop.

Disable GCC SLP autovectorization for the interpreter loop on x86-64

45cd786

bedevere-app bot mentioned this pull request Apr 9, 2025

computed-goto interpreter: Prevent the compiler from merging DISPATCH calls #129987

Closed

mpage added the skip news label Apr 9, 2025

mpage requested review from colesbury and Yhg1s April 9, 2025 15:50

mpage marked this pull request as ready for review April 9, 2025 15:50

mpage requested a review from markshannon as a code owner April 9, 2025 15:50

bedevere-app bot added the awaiting core review label Apr 9, 2025

colesbury approved these changes Apr 9, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Apr 9, 2025

mpage merged commit 1f5682f into python:main Apr 9, 2025
55 checks passed

bedevere-app bot removed the awaiting merge label Apr 9, 2025

mpage deleted the gh-129987-no-slp-vectorize branch April 9, 2025 17:34

mpage mentioned this pull request Apr 9, 2025

Don't inline slow path functions in the interpreter loop #132336

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64 #132295

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64 #132295

Uh oh!

mpage commented Apr 9, 2025 •

edited

Loading

Uh oh!

pinskia commented Apr 9, 2025

Uh oh!

colesbury left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64 #132295

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64 #132295

Uh oh!

Conversation

mpage commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinskia commented Apr 9, 2025

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mpage commented Apr 9, 2025 •

edited

Loading