From 47a682fd2a7153a5164673d38d8e129fcaf4d944 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Thu, 1 Feb 2024 20:54:41 +0800 Subject: [PATCH 01/11] What's new in Python 3.13: JIT compiler --- Doc/whatsnew/3.13.rst | 124 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 124 insertions(+) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index 6b94a3771406fa..647a3760bdd4f2 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -82,6 +82,14 @@ Important deprecations, removals or restrictions: followed by three years of security fixes. +Interpreter improvements: + +* A basic :ref:`JIT compiler ` was added. + It is currently disabled by default (though we may turn it on later). + Performance improvements are modest -- we expect to be improving this + over the next few releases. + + New Features ============ @@ -478,6 +486,122 @@ Optimizations (Contributed by Jakub Kulik in :gh:`113117`.) +.. _whatsnew313-jit-compiler: + +Experimental JIT Compiler +========================= + +:Editor: Guido van Rossum, Ken Jin + +When CPython is configured using the ``--enable-experimental-jit`` option, +a just-in-time compiler is added which can speed up some Python programs. +The internal architecture is roughly as follows. + +Intermediate Representation +--------------------------- + +We start with specialized *Tier 1 bytecode*. +See :ref:`What's new in 3.11 ` for details. + +When the Tier 1 bytecode gets hot enough, the interpreter creates +straight-line sequences of bytecode known as "traces", and translates that +to a new, purely internal *Tier 2 IR*, a.k.a. micro-ops ("uops"). +These straight-line sequences can cross function call boundaries, +allowing more effective optimizations, listed in the next section. + +The Tier 2 IR uses the same stack-based VM as Tier 1, but the +instruction format is better suited to translation to machine code. + +(Tier 2 IR contributed by Mark Shannon and Guido van Rossum.) + +Optimizations +------------- + +We have several optimization and analysis passes for Tier 2 IR, which +are applied before Tier 2 IR is interpreted or translated to machine code. +These optimizations take unoptimized Tier 2 IR and produce optimized Tier 2 +IR: + +* Type propagation -- through forward data-flow analysis, we infer + and deduce information about types. This allows us to eliminate + much of the overhead associated with dynamic typing in the future. + +* Constant propagation -- through forward data-flow analysis, we can reduce + expressions like :: + + a = 1 + b = 2 + c = a + b + + to :: + + a = 1 + b = 2 + c = 3 + +* Guard elimination -- through a combination of constant and type information, + we can eliminate type checks and other guards associated with operations. + As a proof of concept, we managed to eliminate over 70% + of integer type checks in our own benchmarks. + +* Loop splitting -- after the first iteration, we gain a lot more type + information. Thus, we peel the first iteration of loops to produce + an optimized body that exploits this additional type information. + This also achieves a similar effect to an optimization called + loop-invariant code motion, but only for guards. + +* Globals to constant promotion -- global value loads become constant + loads, speeding them up and also allowing for more constant propagation. + +* This section is non-exhaustive and will be updated with further + optimizations, up till CPython 3.13's release. + +(Tier 2 optimizer contributed by Ken Jin, with implementation help +by Guido van Rossum, Mark Shanno, and Jules Poon. Special thanks +to Manuel Rigger and Martin Henz.) + +Execution Engine +---------------- + +There are two execution engines for Tier 2 IR. + +The first is the Tier 2 interpreter, but it is mostly intended for debugging +the earlier stages of the optimization pipeline. If the JIT is not +enabled, the Tier 2 interpreter can be invoked by passing Python the +``-X uops`` option or by setting the ``PYTHON_UOPS`` environment +variable to ``1``. + +The second is machine code. When the ``--enable-experimental-jit`` +option is used, the optimized Tier 2 IR is translated to machine +code, which is then executed. This does not require additional +runtime options. + +The machine code translation process uses an architecture called +*copy-and-patch*. It has no runtime dependencies, but there is a new +build-time dependency on LLVM. + +(Copy-and-patch JIT compiler contributed by Brandt Bucher, +directly inspired by the paper "Copy-and-Patch Compilation" +by Haoran Xu and Fredrik Kjolstad). + +Results and Future Work +----------------------- + +The final performance results will be updated before CPython 3.13's release. + +The JIT compiler is rather unoptimized, and serves as the foundation +for significant optimizations in future releases. As such, we do not +expect the first iteration of the JIT compiler to produce a significant +speedup. + +About +----- + +This work was done by the Faster CPython team, and many other external +contributors. The team consists of engineers from Microsoft, Meta +Quansight, and Bloomberg, who are either paid in part to do this, or +volunteer in their free time. + Deprecated ========== From 2ff848af875039f29ee62e8b6ba0ee1180abbdf0 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Thu, 1 Feb 2024 21:03:13 +0800 Subject: [PATCH 02/11] correct Mark's name --- Doc/whatsnew/3.13.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index 647a3760bdd4f2..f36722b57fca41 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -557,7 +557,7 @@ IR: optimizations, up till CPython 3.13's release. (Tier 2 optimizer contributed by Ken Jin, with implementation help -by Guido van Rossum, Mark Shanno, and Jules Poon. Special thanks +by Guido van Rossum, Mark Shannon, and Jules Poon. Special thanks to Manuel Rigger and Martin Henz.) Execution Engine From b6a94a4c79f5c3363aeb354edca435697dc54a91 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Thu, 1 Feb 2024 21:05:30 +0800 Subject: [PATCH 03/11] missing comma --- Doc/whatsnew/3.13.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index f36722b57fca41..81f21d284df782 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -598,7 +598,7 @@ About ----- This work was done by the Faster CPython team, and many other external -contributors. The team consists of engineers from Microsoft, Meta +contributors. The team consists of engineers from Microsoft, Meta, Quansight, and Bloomberg, who are either paid in part to do this, or volunteer in their free time. From 15b4cead2222b692b3aad6a8d6635a8c0d8301ba Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Thu, 1 Feb 2024 23:15:45 +0800 Subject: [PATCH 04/11] address review --- Doc/whatsnew/3.13.rst | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index 81f21d284df782..ae151c66a804de 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -493,9 +493,9 @@ Experimental JIT Compiler :Editor: Guido van Rossum, Ken Jin -When CPython is configured using the ``--enable-experimental-jit`` option, -a just-in-time compiler is added which can speed up some Python programs. -The internal architecture is roughly as follows. +When CPython is configured using the ``--enable-experimental-jit`` build-time +option, a just-in-time compiler is added which can speed up some Python +programs. The internal architecture is roughly as follows. Intermediate Representation --------------------------- @@ -541,8 +541,10 @@ IR: * Guard elimination -- through a combination of constant and type information, we can eliminate type checks and other guards associated with operations. - As a proof of concept, we managed to eliminate over 70% - of integer type checks in our own benchmarks. + These guards validate specialized operations, but add a slight bit of + overhead. For example, integer addition needs a type check that checks + both operands are integers. As a proof of concept, we managed to eliminate + over 70% of integer type checks in our own benchmarks. * Loop splitting -- after the first iteration, we gain a lot more type information. Thus, we peel the first iteration of loops to produce @@ -572,7 +574,7 @@ enabled, the Tier 2 interpreter can be invoked by passing Python the variable to ``1``. The second is machine code. When the ``--enable-experimental-jit`` -option is used, the optimized Tier 2 IR is translated to machine +build-time option is used, the optimized Tier 2 IR is translated to machine code, which is then executed. This does not require additional runtime options. @@ -581,8 +583,10 @@ The machine code translation process uses an architecture called build-time dependency on LLVM. (Copy-and-patch JIT compiler contributed by Brandt Bucher, -directly inspired by the paper "Copy-and-Patch Compilation" -by Haoran Xu and Fredrik Kjolstad). +directly inspired by the paper +`Copy-and-Patch Compilation `_ +by Haoran Xu and Fredrik Kjolstad. For more information, +`a talk `_ is available.) Results and Future Work ----------------------- From 4f4d4ce8cf48e65728fd75e646463bab01258d51 Mon Sep 17 00:00:00 2001 From: Ken Jin Date: Fri, 2 Feb 2024 01:59:05 +0800 Subject: [PATCH 05/11] fix upstream merge changes --- Doc/whatsnew/3.13.rst | 54 ++++++------------------------------------- 1 file changed, 7 insertions(+), 47 deletions(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index 3659c04e96a4ae..7d8e7222881ea8 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -89,14 +89,6 @@ Interpreter improvements: over the next few releases. -Interpreter improvements: - -* A basic :ref:`JIT compiler ` was added. - It is currently disabled by default (though we may turn it on later). - Performance improvements are modest -- we expect to be improving this - over the next few releases. - - New Features ============ @@ -492,49 +484,11 @@ Optimizations FreeBSD and Solaris. See the ``subprocess`` section above for details. (Contributed by Jakub Kulik in :gh:`113117`.) -.. _whatsnew313-jit-compiler: - -Experimental JIT Compiler -========================= - -When CPython is configured using the ``--enable-experimental-jit`` option, -a just-in-time compiler is added which can speed up some Python programs. - -The internal architecture is roughly as follows. - -* We start with specialized *Tier 1 bytecode*. - See :ref:`What's new in 3.11 ` for details. - -* When the Tier 1 bytecode gets hot enough, it gets translated - to a new, purely internal *Tier 2 IR*, a.k.a. micro-ops ("uops"). - -* The Tier 2 IR uses the same stack-based VM as Tier 1, but the - instruction format is better suited to translation to machine code. - -* We have several optimization passes for Tier 2 IR, which are applied - before it is interpreted or translated to machine code. - -* There is a Tier 2 interpreter, but it is mostly intended for debugging - the earlier stages of the optimization pipeline. If the JIT is not - enabled, the Tier 2 interpreter can be invoked by passing Python the - ``-X uops`` option or by setting the ``PYTHON_UOPS`` environment - variable to ``1``. - -* When the ``--enable-experimental-jit`` option is used, the optimized - Tier 2 IR is translated to machine code, which is then executed. - This does not require additional runtime options. - -* The machine code translation process uses an architecture called - *copy-and-patch*. It has no runtime dependencies, but there is a new - build-time dependency on LLVM. - -(JIT by Brandt Bucher, inspired by a paper by Haoran Xu and Fredrik Kjolstad. -Tier 2 IR by Mark Shannon and Guido van Rossum. -Tier 2 optimizer by Ken Jin.) .. _whatsnew313-jit-compiler: + Experimental JIT Compiler ========================= @@ -544,6 +498,7 @@ When CPython is configured using the ``--enable-experimental-jit`` build-time option, a just-in-time compiler is added which can speed up some Python programs. The internal architecture is roughly as follows. + Intermediate Representation --------------------------- @@ -561,6 +516,7 @@ instruction format is better suited to translation to machine code. (Tier 2 IR contributed by Mark Shannon and Guido van Rossum.) + Optimizations ------------- @@ -609,6 +565,7 @@ IR: by Guido van Rossum, Mark Shannon, and Jules Poon. Special thanks to Manuel Rigger and Martin Henz.) + Execution Engine ---------------- @@ -635,6 +592,7 @@ directly inspired by the paper by Haoran Xu and Fredrik Kjolstad. For more information, `a talk `_ is available.) + Results and Future Work ----------------------- @@ -645,6 +603,7 @@ for significant optimizations in future releases. As such, we do not expect the first iteration of the JIT compiler to produce a significant speedup. + About ----- @@ -653,6 +612,7 @@ contributors. The team consists of engineers from Microsoft, Meta, Quansight, and Bloomberg, who are either paid in part to do this, or volunteer in their free time. + Deprecated ========== From 22b056570c558c2ff78da05b4fcb21000461d76b Mon Sep 17 00:00:00 2001 From: Ken Jin Date: Fri, 2 Feb 2024 03:05:58 +0800 Subject: [PATCH 06/11] talk about copy and patch compilation speed --- Doc/whatsnew/3.13.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index 7d8e7222881ea8..151638fef9079e 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -582,9 +582,11 @@ build-time option is used, the optimized Tier 2 IR is translated to machine code, which is then executed. This does not require additional runtime options. -The machine code translation process uses an architecture called +The machine code translation process uses a technique called *copy-and-patch*. It has no runtime dependencies, but there is a new -build-time dependency on LLVM. +build-time dependency on LLVM. The main benefit of this technique is +rapid compilation times, reported as orders of magnitudes faster versus +traditional compilation techniques in the paper linked below. (Copy-and-patch JIT compiler contributed by Brandt Bucher, directly inspired by the paper From 63fc77bdb0a9c112e438a13a00d104c6aed72614 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Fri, 2 Feb 2024 11:51:53 +0800 Subject: [PATCH 07/11] rapid -> fast --- Doc/whatsnew/3.13.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index 151638fef9079e..2ce1b2526ebec5 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -585,8 +585,10 @@ runtime options. The machine code translation process uses a technique called *copy-and-patch*. It has no runtime dependencies, but there is a new build-time dependency on LLVM. The main benefit of this technique is -rapid compilation times, reported as orders of magnitudes faster versus -traditional compilation techniques in the paper linked below. +fast compilation, reported as orders of magnitudes faster versus +traditional compilation techniques in the paper linked below. The code +produced is slightly less optimized, but suitable for a baseline JIT +compiler. (Copy-and-patch JIT compiler contributed by Brandt Bucher, directly inspired by the paper From a3185540f488369c2842c8cba8b3a759465715c5 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Fri, 2 Feb 2024 11:53:19 +0800 Subject: [PATCH 08/11] Update 3.13.rst --- Doc/whatsnew/3.13.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index 2ce1b2526ebec5..dfd03733740e27 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -577,7 +577,7 @@ enabled, the Tier 2 interpreter can be invoked by passing Python the ``-X uops`` option or by setting the ``PYTHON_UOPS`` environment variable to ``1``. -The second is machine code. When the ``--enable-experimental-jit`` +The second is the JIT compiler. When the ``--enable-experimental-jit`` build-time option is used, the optimized Tier 2 IR is translated to machine code, which is then executed. This does not require additional runtime options. From 68d60e4fb82699682a8e727d006a031ff02dd5aa Mon Sep 17 00:00:00 2001 From: Ken Jin Date: Fri, 2 Feb 2024 19:08:11 +0800 Subject: [PATCH 09/11] Apply suggestions from code review Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com> --- Doc/whatsnew/3.13.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index dfd03733740e27..e78782a96dbb41 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -559,7 +559,7 @@ IR: loads, speeding them up and also allowing for more constant propagation. * This section is non-exhaustive and will be updated with further - optimizations, up till CPython 3.13's release. + optimizations, up till CPython 3.13's beta release. (Tier 2 optimizer contributed by Ken Jin, with implementation help by Guido van Rossum, Mark Shannon, and Jules Poon. Special thanks @@ -600,7 +600,7 @@ by Haoran Xu and Fredrik Kjolstad. For more information, Results and Future Work ----------------------- -The final performance results will be updated before CPython 3.13's release. +The final performance results will be updated before CPython 3.13's beta release. The JIT compiler is rather unoptimized, and serves as the foundation for significant optimizations in future releases. As such, we do not From 18c4ba80911ec6c4526fb1c73121932ea5e598b6 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Sat, 3 Feb 2024 02:27:20 +0800 Subject: [PATCH 10/11] Apply changes by Michael Co-Authored-By: Michael Droettboom --- Doc/whatsnew/3.13.rst | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index e78782a96dbb41..22f7a8bbc86263 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -559,7 +559,7 @@ IR: loads, speeding them up and also allowing for more constant propagation. * This section is non-exhaustive and will be updated with further - optimizations, up till CPython 3.13's beta release. + optimizations, until CPython 3.13's beta release. (Tier 2 optimizer contributed by Ken Jin, with implementation help by Guido van Rossum, Mark Shannon, and Jules Poon. Special thanks @@ -569,9 +569,10 @@ to Manuel Rigger and Martin Henz.) Execution Engine ---------------- -There are two execution engines for Tier 2 IR. +There are two execution engines for Tier 2 IR: +the Tier 2 interpreter and the Just-in-Time (JIT) compiler. -The first is the Tier 2 interpreter, but it is mostly intended for debugging +The Tier 2 interpreter is mostly intended for debugging the earlier stages of the optimization pipeline. If the JIT is not enabled, the Tier 2 interpreter can be invoked by passing Python the ``-X uops`` option or by setting the ``PYTHON_UOPS`` environment @@ -584,7 +585,8 @@ runtime options. The machine code translation process uses a technique called *copy-and-patch*. It has no runtime dependencies, but there is a new -build-time dependency on LLVM. The main benefit of this technique is +build-time dependency on `LLVM `_. +The main benefit of this technique is fast compilation, reported as orders of magnitudes faster versus traditional compilation techniques in the paper linked below. The code produced is slightly less optimized, but suitable for a baseline JIT From 2312af0964480f1a5cdb6eaecffb0daf4363d1c3 Mon Sep 17 00:00:00 2001 From: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Date: Wed, 14 Feb 2024 00:33:14 +0800 Subject: [PATCH 11/11] Address Guido's review --- Doc/whatsnew/3.13.rst | 58 ++++++++++++++++--------------------------- 1 file changed, 21 insertions(+), 37 deletions(-) diff --git a/Doc/whatsnew/3.13.rst b/Doc/whatsnew/3.13.rst index 22f7a8bbc86263..18b8f3af5d52fd 100644 --- a/Doc/whatsnew/3.13.rst +++ b/Doc/whatsnew/3.13.rst @@ -525,29 +525,22 @@ are applied before Tier 2 IR is interpreted or translated to machine code. These optimizations take unoptimized Tier 2 IR and produce optimized Tier 2 IR: -* Type propagation -- through forward data-flow analysis, we infer - and deduce information about types. This allows us to eliminate - much of the overhead associated with dynamic typing in the future. - -* Constant propagation -- through forward data-flow analysis, we can reduce - expressions like :: - - a = 1 - b = 2 - c = a + b +* This section is non-exhaustive and will be updated with further + optimizations, until CPython 3.13's beta release. - to :: +* Type propagation -- through forward + `data-flow analysis `_, + we infer and deduce information about types. - a = 1 - b = 2 - c = 3 +* Constant propagation -- through forward data-flow analysis, we can + evaluate in advance bytecode which we know operate on constants. * Guard elimination -- through a combination of constant and type information, we can eliminate type checks and other guards associated with operations. These guards validate specialized operations, but add a slight bit of overhead. For example, integer addition needs a type check that checks - both operands are integers. As a proof of concept, we managed to eliminate - over 70% of integer type checks in our own benchmarks. + both operands are integers. If we know that a integer guards' operands + are guaranteed to be integers, we can safely eliminate it. * Loop splitting -- after the first iteration, we gain a lot more type information. Thus, we peel the first iteration of loops to produce @@ -557,13 +550,12 @@ IR: * Globals to constant promotion -- global value loads become constant loads, speeding them up and also allowing for more constant propagation. + This work relies on dictionary watchers, implemented in 3.12. + (Contributed by Mark Shannon in :gh:`113710`.) -* This section is non-exhaustive and will be updated with further - optimizations, until CPython 3.13's beta release. - -(Tier 2 optimizer contributed by Ken Jin, with implementation help -by Guido van Rossum, Mark Shannon, and Jules Poon. Special thanks -to Manuel Rigger and Martin Henz.) +(Tier 2 optimizer contributed by Ken Jin and Mark Shannon, +with implementation help by Guido van Rossum. Special thanks +to Manuel Rigger.) Execution Engine @@ -590,33 +582,25 @@ The main benefit of this technique is fast compilation, reported as orders of magnitudes faster versus traditional compilation techniques in the paper linked below. The code produced is slightly less optimized, but suitable for a baseline JIT -compiler. +compiler. Fast compilation is critical to reduce the runtime overhead +of the JIT compiler. (Copy-and-patch JIT compiler contributed by Brandt Bucher, directly inspired by the paper `Copy-and-Patch Compilation `_ by Haoran Xu and Fredrik Kjolstad. For more information, -`a talk `_ is available.) +`a talk `_ by Brandt Bucher +is available.) Results and Future Work ----------------------- -The final performance results will be updated before CPython 3.13's beta release. +The final performance results will be published here before +CPython 3.13's beta release. The JIT compiler is rather unoptimized, and serves as the foundation -for significant optimizations in future releases. As such, we do not -expect the first iteration of the JIT compiler to produce a significant -speedup. - - -About ------ - -This work was done by the Faster CPython team, and many other external -contributors. The team consists of engineers from Microsoft, Meta, -Quansight, and Bloomberg, who are either paid in part to do this, or -volunteer in their free time. +for significant optimizations in future releases. Deprecated