Skip to content

gh-129987: Disable GCC SLP autovectorization for the interpreter loop on x86-64 #132295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 9, 2025

Conversation

mpage
Copy link
Contributor

@mpage mpage commented Apr 9, 2025

#131750 mysteriously caused a ~6% regression for the free-threaded build. The cause was poor code generation of opcode dispatch in the interpreter loop. Before the change the dispatch code looked like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19cd0a: mov    -0x268(%rbp),%rsi
          19cd11: movzbl %ah,%ecx
          19cd14: movzbl %al,%eax
          19cd17: mov    %ecx,%r10d
          19cd1a: jmp    *(%rsi,%rax,8)

After the change, the dispatch code looked like:

# Shared dispatch code
/root/src/cpython/Python/generated_cases.c.h:81 [BINARY_OP]
            DISPATCH();

          19dd67: mov    -0x280(%rbp),%r10
          19dd6e: movzbl %ah,%ecx
          19dd71: movzbl %al,%eax
          19dd74: mov    %ecx,%r14d
          19dd77: mov    -0x270(%rbp),%rcx
          19dd7e: mov    (%rcx,%rax,8),%rdx
          19dd82: nopw   0x0(%rax,%rax,1)
          19dd88: movq   -0x258(%rbp),%xmm0
          19dd90: movq   %r12,%xmm4
          19dd95: punpcklqdq %xmm4,%xmm0
          19dd99: movhlps %xmm0,%xmm3
          19dd9c: movq   %xmm0,%r15
          19dda1: movq   %xmm3,%r11
          19dda6: mov    %r11,%rcx
          19dda9: jmp    *%rdx
          
# Duplicated dispatch code
/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19dde4: movzbl %ah,%ecx
          19dde7: movzbl %al,%eax
          19ddea: mov    %ecx,%r14d
          19dded: mov    -0x270(%rbp),%rcx
          19ddf4: mov    (%rcx,%rax,8),%rdx
          19ddf8: jmp    19dd99 <_PyEval_EvalFrameDefault+0x289>

There are two problems:

  1. We now have two jumps (one direct jump to the shared dispatch logic and one indirect jump to the next opcode handler) instead of one (the indirect jump to the opcode handler).
  2. There's a significant amount of register shuffling in the shared dispatch code.

Both of these problems appear to be caused by GCC's SLP autovectorizer. After the change, it decides to store both the next_instr pointer and the stack_pointer in a single 128 bit register in the shared basic block that contains the opcode dispatch. This is introduced in the slp1 pass (tree dump below):

  _24061 = VIEW_CONVERT_EXPR<long unsigned int>(stack_pointer_14587);
  _24062 = VIEW_CONVERT_EXPR<long unsigned int>(next_instr_14097);
  _24063 = {_24062, _24061};

  <bb 19> [count: 1658034300]:
  # frame_2363(ab) = PHI <frame_20485(4258), frame_20519(18)>
  # oparg_1245(ab) = PHI <oparg_20252(4258), oparg_14635(18)>
  # next_instr_1246(ab) = PHI <next_instr_11924(4258), next_instr_14097(18)>
  # stack_pointer_2976(ab) = PHI <stack_pointer_20484(4258), stack_pointer_14587(18)>
  # _3209 = PHI <_20217(4258), _20681(18)>

  # 
  # Combination of next_instr and stack_pointer:
  # 

  # vect_next_instr_1246.7061_24064 = PHI <vect_next_instr_11924.7060_24060(4258), _24063(18)>
  _24067 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 64>;
  _24068(ab) = (union _PyStackRef *) _24067;
  _24065 = BIT_FIELD_REF <vect_next_instr_1246.7061_24064, 64, 0>;
  _24066(ab) = (union _Py_CODEUNIT *) _24065;

  # DEBUG stack_pointer => stack_pointer_2976(ab)
  # DEBUG next_instr => next_instr_1246(ab)
  # DEBUG oparg => oparg_1245(ab)
  # DEBUG frame => frame_2363(ab)
  goto _3209;

Disabling the SLP autovectorization pass for the interpreter loop fixes both problems. After this change the opcode dispatch code looks like:

/root/src/cpython/Python/generated_cases.c.h:8808 [LOAD_FAST_BORROW]
            DISPATCH();

          19aa37: mov    -0x260(%rbp),%rsi
          19aa3e: movzbl %ah,%ecx
          19aa41: movzbl %al,%eax
          19aa44: movslq %ecx,%r15
          19aa47: jmp    *(%rsi,%rax,8)

Performance improves by ~8% for the free-threaded build.

Surprisingly, this also seems to improve performance for the default build by ~4%. I don't understand why and I don't fully trust the result. The generated dispatch code for the default build looks unaffected by this change. Additionally, measuring instructions retired using fastbench shows a negligible change, whereas it shows a ~8% reduction for the free-threaded build.

@pinskia
Copy link

pinskia commented Apr 9, 2025

I think this is the same as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777 .

@mpage mpage requested review from colesbury and Yhg1s April 9, 2025 15:50
@mpage mpage marked this pull request as ready for review April 9, 2025 15:50
@mpage mpage requested a review from markshannon as a code owner April 9, 2025 15:50
Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@mpage mpage merged commit 1f5682f into python:main Apr 9, 2025
55 checks passed
@mpage mpage deleted the gh-129987-no-slp-vectorize branch April 9, 2025 17:34
seehwan pushed a commit to seehwan/cpython that referenced this pull request Apr 16, 2025
…r loop on x86-64 (python#132295)

The SLP autovectorizer can cause poor code generation for opcode dispatch, negating any benefit we get from vectorization elsewhere in the interpreter loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants