From d3ce64dfd78b2b2776273c2245bbc9ab8eff5356 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Tue, 23 Aug 2022 23:01:01 -0700 Subject: [PATCH 01/12] Start describing the 3.11 bytecode interpreter (DON'T REVIEW) This commit is just a backup in case my machine crashes. --- internals/index.rst | 1 + internals/interpreter.rst | 214 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 215 insertions(+) create mode 100644 internals/interpreter.rst diff --git a/internals/index.rst b/internals/index.rst index ce04c665c1..335ac8d350 100644 --- a/internals/index.rst +++ b/internals/index.rst @@ -9,3 +9,4 @@ CPython's Internals parser compiler garbage-collector + interpreter diff --git a/internals/interpreter.rst b/internals/interpreter.rst new file mode 100644 index 0000000000..a6c62fb207 --- /dev/null +++ b/internals/interpreter.rst @@ -0,0 +1,214 @@ +.. _interpreter: + +===================================== +The CPython 3.11 Bytecode Interpreter +===================================== + +.. highlight:: none + +Preface +======= + +The CPython 3.11 bytecode interpreter (a.k.a. virtual machine) has a number of improvements over 3.10. +We describe the inner workings of the 3.11 interpreter here, with an emphasis on understanding not just the code but its design. +While the interpreter is forever evolving, and the 3.12 design will undoubtedly be different again, understanding the 3.11 design will help you understand future improvements to the interpreter. + +Introduction +============ + +The bytecode interpreter's job is to execute Python code. +Its main input is a code object, although this is not a direct argument to the interpreter. +The interpreter is structured as a (potentially recursive) function taking a thread state (``tstate``) and a stack frame (``frame``). +The function also takes an integer ``throwflag``, which is used by the implementation of ``generator.throw()``. +It returns a new reference to a Python object (``PyObject *``) or an error indicator, ``NULL``. +Since :pep:`523` this function is configurable by setting ``interp->eval_frame``; we describe only the default function, ``_PyEval_EvalFrameDefault()``. +(This function's signature has evolved and no longer matches what PEP 523 specifies; the thread state argument is added and the stack frame argument is no longer an object.) + +The interpreter finds the code object by looking in the stack frame (``frame->f_code``). +Various other items needed by the interpreter (e.g. globals and builtins) are also accessed via the stack frame. +The thread state stores exception information and a variety of other information, such as the recursion depth. +The thread state is also used to access per-interpreter state (``tstate->interp``) and per-runtime (i.e., truly global) state (``tstate->interp->runtime``). + +Note the slightly confusing terminology here. +"Interpreter" refers to the bytecode interpreter, a recursive function. +"Interpreter state" refers to state shared by threads, each of which may be running its own bytecode interpreter. +A single process may even host multiple interpreters, each with their own interpreter state, but sharing runtime state. +The topic of multiple interpreters is covered by several PEPs, notably :pep:`684`, :pep:`630`, and :pep:`554` (and more coming). +The current document focuses on the bytecode interpreter. + +Code objects +============ + +The interpreter uses as its starting point a code object (```frame->f_code``). +Code objects contain many fields used by the interpreter, as well as some for use by debuggers and other tools. +In 3.11, the final field of a code object is an array of indeterminate length containing the bytecode, ``code->co_code_adaptive``. +(In previous versions the code object was a ``bytes`` object, ``code->co_code``; it was changed to save an allocation and to allow it to be mutated.) + +Code objects are typically produced by the bytecode :ref:``compiler``, although often they are written to disk by one process and read back in by another. +The disk version of a code object is serialized using the `marshal protocol `_. +Some code objects are pre-loaded into the interpreter using ``Tools/scripts/deepfreeze.py``, which writes ``Python/deepfreeze/deepfreeze.c``. + +Code objects are nominally immutable. +Some fields (including ``co_code_adaptive``) are mutable, but mutable fields are not included when code objects are hashed or compared. + +Instruction decoding +==================== + +The first task of the interpreter is to decode the bytecode instructions. +Bytecode is stored as an array of 16-bit code units (``_Py_CODEUNIT``). +Each code unit contains an 8-bit ``opcode`` and an 8-bit argument (``oparg``), both unsigned. +In order to make the bytecode format independent of the machine architecture when stored on disk, ``opcode`` is always the first byte and ``oparg`` is always the second byte. +Macros are used to extract the ``opcode`` and ``oparg`` from a code unit (``_Py_OPCODE(word)`` and ``_Py_OPARG(word)``). +Some instructions (e.g. ``NOP`` or ``POP_TOP``) have no argument -- in this case we ignore ``oparg``. + +A simple instruction decoding loop would look like this:: + + _Py_CODEUNIT *first_instr = code->co_code_adaptive; + _Py_CODEUNIT *next_instr = first_instr; + while (1) { + _Py_CODEUNIT word = *next_instr; + unsigned char opcode = _Py_OPCODE(word); + unsigned char oparg = _Py_OPARG(word); + next_instr++; + + switch (opcode) { + case NOP: + break; + + // ... A case for each known opcode ... + + default: + PyErr_SetString(PyExc_SystemError, "unknown opcode"); + return NULL; + } + } + +This format supports 256 different opcodes, which is sufficient. +However, it also limits ``oparg`` to 8-bit values, which is not. +To overcome this, the ``EXTENDED_ARG`` opcode allows us to prefix any instruction with one or more additional data bytes. +For example, this sequence of code units:: + + EXTENDED_ARG 1 + EXTENDED_ARG 0 + LOAD_CONST 0 + +would set ``opcode`` to ``LOAD_CONST`` and ``oparg`` to ``65536`` (i.e., ``2**16``). +The compiler should limit itself to at most three ``EXTENDED_ARG`` prefixes, to allow the resulting ``oparg`` to fit in 32 bits, but the interpreter does not check this. +A series of code units starting with ``EXTENDED_ARG`` is called a complete instruction, to distinguish it from code unit, which is always two bytes. + +If we allow ourselves the use of ``goto``, the decoding loop (still far from realistic) could look like this:: + + _Py_CODEUNIT *first_instr = code->co_code_adaptive; + _Py_CODEUNIT *next_instr = first_instr; + while (1) { + _Py_CODEUNIT word = *next_instr; + unsigned char opcode = _Py_OPCODE(word); + unsigned int oparg = _Py_OPARG(word); + next_instr++; + + dispatch_opcode: + switch (opcode) { + case NOP: + break; + + // ... A case for each known opcode ... + + case EXTENDED_ARG: + word = *next_instr; + opcode = _Py_OPCODE(word); + oparg *= 256; + oparg += _Py_OPARG(word); + next_instr++; + goto dispatch_opcode; + + default: + PyErr_SetString(PyExc_SystemError, "unknown opcode"); + return NULL; + } + } + +Jumps +===== + +Note that in the switch statement, ``next_instr`` (the "instruction offset") already points to the next instruction. +Thus, jump instructions can be implemented by manipulating ``next_instr``: + +- An absolute jump (``JUMP_ABSOLUTE``) sets ``next_instr = first_instr + oparg``. +- A relative jump forward (``JUMP_FORWARD``) sets ``next_instr += oparg``. +- A relative jump backward sets ``next_instr -= oparg``. + +A relative jump whose ``oparg`` is zero is a no-op. + +Inline cache entries +==================== + +Some (usually specialized) instructions have an associated "inline cache". +The inline cache consists of one or more two-byte entries included in the bytecode array. +The size of the inline cache for a particular instruction is fixed by its ``opcode`` alone. +Cache entries are reserved by the compiler and initialized with zeros. +If an instruction has an inline cache, the layout of its cache can be described by a ``struct`` definition and the address of the cache is given by casting ``next_instr`` to a pointer to the cache ``struct``. +The size of such a ``struct`` must be independent of the machine architecture and word size. +Even though inline cache entries are represented by code units, they do not have to conform to the ``opcode``/``oparg`` format. + +The instruction implementation is responsible for advancing ``next_instr`` past the inline cache. +For example, if an instruction's inline cache is four bytes (two code units) in size, the code for the instruction must contain ``next_instr += 2;``. +This is equivalent to a relative forward jump by that many code units. + +Serializing non-zero cache entries would present a problem because the serialization (``marshal``) format must be independent of the machine byte order. + +More information about the use of inline caches can be found in :pep:`659` (search for "ancillary data"). + +The evaluation stack +==================== + +Apart from unconditional jumps, almost all instructions read or write some data in the form of object references (``PyObject *``). +The CPython bytecode interpreter is a stack machine, meaning that it operates by pushing data onto and popping it off the stack. +For example, the "add" instruction (which used to be called ``BINARY_ADD`` but is now ``BINARY_OP 0``) pops two objects off the stack and pushes the result back onto the stack. +An interesting property of the CPython bytecode interpreter is that the stack size required to evaluate a given function is known in advance. +The stack size is computed by the bytecode compiler and is stored in ``code->co_stacksize``. +The interpreter uses this information to allocate stack. + +The stack grows up in memory; the operation ``PUSH(x)`` is equivalent to ``*stack_pointer++ = x``, whereas ``x = POP()`` means ``x = *--stack_pointer``. +There is no overflow or underflow check (except when compiled in debug mode) -- it would be too expensive, so we really trust the compiler. + +At any point during execution, the stack level is knowable based on the instruction pointer alone, and some properties of each item on the stack are also known. +In particular, only a few instructions may push a ``NULL`` onto the stack, and the positions that may be ``NULL`` are known. +A few other instructions (``GET_ITER``, ``FOR_ITER``) push or pop an object that is known to be an interator. + +Do not confuse the evaluation stack with the call stack, which is used to implement calling and returning from functions. + +Error handling +============== + +When an instruction like encounters an error, an exception is raised. +At this point a traceback entry is added to the exception (by ``PyTraceBack_Here()``) and cleanup is performed. +In the simplest case (absent any ``try`` blocks) this results in the remaining objects being popped off the evaluation stack and their reference count (if not ``NULL``) decremented. +Then the interpreter function (``_PyEval_EvalFrameDefault()``) returns ``NULL``. + +However, if an exception is raised in a ``try`` block, the interpreter must jump to the corresponding ``except`` or ``finally`` block. +In 3.10 and before there was a separate "block stack" which was used to keep track of nesting ``try`` blocks. +In 3.11 this mechanism has been replaced by a statically generated table, `code->co_exceptiontable``. +The advantage of this approach is that entering and leaving a ``try`` block normally does not execute any code, making execution faster. +But of course the table needs to be generated by the compiler, and decoded (by ``get_exception_handler``) when an exception happens. +(A Python version of the decoder exists as ``_parse_exception_table()`` in ``dis.py``.) + +Python-to-Python calls +====================== + +The ``_PyEval_EvalFrameDefault()`` function is recursive, because sometimes the interpreter calls some C function that calls back into the interpreter. +In 3.10 and before this was the case even when a Python function called another Python function: +The ``CALL`` instruction would call the ``tp_call`` dispatch function of the callee, which would extract the code object, create a new frame for the call stack, and then call back into the interpreter. +This approach is very general but consumes several C stack frames for each nested Python call, thereby increasing the risk of an (unrecoverable) C stack overflow. + +In 3.11 the ``CALL`` instruction special-cases function objects to "inline" the call. +When a call gets inlined, a new frame gets pushed onto the call stack and the interpreter "jumps" to the start of the callee's bytecode. +When the callee executes a ``RETURN_VALUE`` instruction, the frame is popped off the call stack and the interpreter returns to the caller. +There is a flag in the frame (``frame->is_entry``) that indicates whether the frame was inlined. +If ``RETURN_VALUE`` returns to a caller where this flag is set, it performs the usual cleanup and return from ``_PyEval_EvalFrameDefault()``. + +A similar check is performed when an unhandled exception occurs. + +The call stack +============== + +XXX From 8b73280c3ab84ec05ddd7c98d740d9146ae6d56f Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Fri, 2 Sep 2022 22:04:43 -0700 Subject: [PATCH 02/12] Exc.table, loc.table, a bit about variables, and TODO stuff --- internals/interpreter.rst | 80 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 78 insertions(+), 2 deletions(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index a6c62fb207..5cdba3e9b3 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -190,7 +190,57 @@ In 3.10 and before there was a separate "block stack" which was used to keep tra In 3.11 this mechanism has been replaced by a statically generated table, `code->co_exceptiontable``. The advantage of this approach is that entering and leaving a ``try`` block normally does not execute any code, making execution faster. But of course the table needs to be generated by the compiler, and decoded (by ``get_exception_handler``) when an exception happens. -(A Python version of the decoder exists as ``_parse_exception_table()`` in ``dis.py``.) + +Exception table format +---------------------- + +The table is conceptually a list of records, each containing four variable-length integer fields (in a unique format, see below): + +- start: start of `try` block, in code units from the start of the bytecode +- length: size of the `try` block, in code units +- target: start of the first instruction of the `except` or `finally` block, in code units from the start of the bytecode +- depth_and_lasti: the low bit gives the "lasti" flag, the remaining bits give the stack depth + +The stack depth is used to clean up evaluation stack entries above this depth. +The "lasti" flag indicates whether, after stack cleanup, the instruction offset of the raising instruction should be pushed. +For more information on the design, see the file ``Objects/exception_handling_notes.txt``. + +Each varint is encoded as one or more bytes. +The high bit (bit 7) is reserved for random access -- it is set for the first varint of a record. +The second bit (bit 6) indicates whether this is the last byte or not -- it is set for all but the last bytes of a varint. +The low 6 bits (bits 0-5) are used for the integer value, in big-endian order. + +To find the table entry (if any) for a given instruction offset, we can use bisection without decoding the whole table. +We bisect the raw bytes, at each probe finding the start of the record by scanning back for a byte with the high bit set, and then decode the first varint. +See ``get_exception_handler()`` for the exact code (like all bisection algorithms, the code is a bit subtle). + +The locations table +------------------- + +Whenever an exception is raised, we add a traceback entry to the exception. +The ``tb_lineno`` field of a traceback entry must be set to the line number of the instruction that raised it. +This field is computed from the locations table, ``co_linetable`` (this name is an understatement), using ``PyCode_Addr2Line()``. +This table has an entry for every instruction rather than for every ``try`` block, so a compact format is very important. + +The full design of the 3.11 locations table is written up in ``Objects/locations.md``. +While there are rumors that this file is slightly out of date, it is still the best reference we have. +Don't be confused by ``lnotab_notes.txt``, which describes the 3.10 format. +For backwards compatibility this format is still supported by the ``co_lnotab`` property. + +The 3.11 location table format is different because it stores not just the starting line number for each instruction, but also the end line number, *and* the start and end column numbers. +Note that traceback objects don't store all this information -- they store the start line number, for backward compatibility, and the "last instruction" value. +The rest can be computed from the last instruction (``tb_lasti``) with the help of the locations table. +For Python code, a convenient method exists, ``co_positions()``, which returns an iterator of *(line, endline, column, endcolumn)* tuples, one per instruction. +There is also ``co_lines()`` which returns an interator of *(start, end, line)* tuples, where *start* and *end* are bytecode offsets. +The latter is described by :pep:`626`. +It is more compact, but doesn't return end line numbers or column offsets. +For C code, you have to call ``PyCode_Addr2Location()``. + +Fortunately, the locations table is only consulted by exception handling (to set ``tb_lineno``) and by tracing (to pass the line number to the tracing function). +In order to reduce the overhead during tracing, the mapping from instruction offset to linenumber is cached in the ``_co_linearray`` field. + +XXX exception chaining +---------------------- Python-to-Python calls ====================== @@ -211,4 +261,30 @@ A similar check is performed when an unhandled exception occurs. The call stack ============== -XXX +XXX Also frame layout and use, locals "plus" + +All sorts of variables +====================== + +The bytecode compiler determines for each variable name in which scope it is defined and generates instructions accordingly. +For example, loading a local variable onto the stack is done using ``LOAD_FAST``, while loading a global is done using ``LOAD_GLOBAL``. +The key types of variables are: +- fast locals: used in functions +- (slow or regular) locals: used in classes and at the top level +- globals and builtins: the compiler does not distinguish between globals and builtins (though the specializing interpreter does) +- cells -- used for nonlocal references + +XXX More + +XXX Getting variable names + +XXX More +======== + +- co_consts, co_names, co_varnames, and their ilk +- How calls work (how args are transferred, return, exceptions) +- Generators, async functions, async generators, and ``yield from`` (next, send, throw, close; and await; and how this code breaks the interpreter abstraction) +- Eval breaker (interrupts, GIL) +- Tracing +- Setting the current lineno (debugger-induced jumps) +- Specialization, inline caches etc. From e4d0641d2ed3995b5bd876a0c64bae9d992b7d82 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Sat, 3 Sep 2022 15:45:04 -0700 Subject: [PATCH 03/12] Shorten instr.decoding; dd exc.chaining; various tweaks --- internals/interpreter.rst | 82 +++++++++++++++------------------------ 1 file changed, 32 insertions(+), 50 deletions(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 5cdba3e9b3..4abb9db267 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -39,12 +39,12 @@ The current document focuses on the bytecode interpreter. Code objects ============ -The interpreter uses as its starting point a code object (```frame->f_code``). +The interpreter uses as its starting point a code object (``frame->f_code``). Code objects contain many fields used by the interpreter, as well as some for use by debuggers and other tools. In 3.11, the final field of a code object is an array of indeterminate length containing the bytecode, ``code->co_code_adaptive``. (In previous versions the code object was a ``bytes`` object, ``code->co_code``; it was changed to save an allocation and to allow it to be mutated.) -Code objects are typically produced by the bytecode :ref:``compiler``, although often they are written to disk by one process and read back in by another. +Code objects are typically produced by the bytecode :ref:`compiler`, although often they are written to disk by one process and read back in by another. The disk version of a code object is serialized using the `marshal protocol `_. Some code objects are pre-loaded into the interpreter using ``Tools/scripts/deepfreeze.py``, which writes ``Python/deepfreeze/deepfreeze.c``. @@ -57,7 +57,7 @@ Instruction decoding The first task of the interpreter is to decode the bytecode instructions. Bytecode is stored as an array of 16-bit code units (``_Py_CODEUNIT``). Each code unit contains an 8-bit ``opcode`` and an 8-bit argument (``oparg``), both unsigned. -In order to make the bytecode format independent of the machine architecture when stored on disk, ``opcode`` is always the first byte and ``oparg`` is always the second byte. +In order to make the bytecode format independent of the machine byte order when stored on disk, ``opcode`` is always the first byte and ``oparg`` is always the second byte. Macros are used to extract the ``opcode`` and ``oparg`` from a code unit (``_Py_OPCODE(word)`` and ``_Py_OPARG(word)``). Some instructions (e.g. ``NOP`` or ``POP_TOP``) have no argument -- in this case we ignore ``oparg``. @@ -66,21 +66,11 @@ A simple instruction decoding loop would look like this:: _Py_CODEUNIT *first_instr = code->co_code_adaptive; _Py_CODEUNIT *next_instr = first_instr; while (1) { - _Py_CODEUNIT word = *next_instr; + _Py_CODEUNIT word = *next_instr++; unsigned char opcode = _Py_OPCODE(word); - unsigned char oparg = _Py_OPARG(word); - next_instr++; - + unsigned int oparg = _Py_OPARG(word); switch (opcode) { - case NOP: - break; - - // ... A case for each known opcode ... - - default: - PyErr_SetString(PyExc_SystemError, "unknown opcode"); - return NULL; - } + // ... A case for each opcode ... } This format supports 256 different opcodes, which is sufficient. @@ -92,45 +82,21 @@ For example, this sequence of code units:: EXTENDED_ARG 0 LOAD_CONST 0 -would set ``opcode`` to ``LOAD_CONST`` and ``oparg`` to ``65536`` (i.e., ``2**16``). +would set ``opcode`` to ``LOAD_CONST`` and ``oparg`` to ``65536`` (i.e., ``0x1_00_00``). The compiler should limit itself to at most three ``EXTENDED_ARG`` prefixes, to allow the resulting ``oparg`` to fit in 32 bits, but the interpreter does not check this. A series of code units starting with ``EXTENDED_ARG`` is called a complete instruction, to distinguish it from code unit, which is always two bytes. +The following loop, to be inserted just above the ``switch`` statement, will make it decode a complete instruction:: -If we allow ourselves the use of ``goto``, the decoding loop (still far from realistic) could look like this:: - - _Py_CODEUNIT *first_instr = code->co_code_adaptive; - _Py_CODEUNIT *next_instr = first_instr; - while (1) { - _Py_CODEUNIT word = *next_instr; - unsigned char opcode = _Py_OPCODE(word); - unsigned int oparg = _Py_OPARG(word); - next_instr++; - - dispatch_opcode: - switch (opcode) { - case NOP: - break; - - // ... A case for each known opcode ... - - case EXTENDED_ARG: - word = *next_instr; - opcode = _Py_OPCODE(word); - oparg *= 256; - oparg += _Py_OPARG(word); - next_instr++; - goto dispatch_opcode; - - default: - PyErr_SetString(PyExc_SystemError, "unknown opcode"); - return NULL; - } + while (opcode == EXTENDED_ARG) { + word = *next_instr++; + opcode = _Py_OPCODE(word); + oparg = (oparg << 8) | _Py_OPARG(word); } Jumps ===== -Note that in the switch statement, ``next_instr`` (the "instruction offset") already points to the next instruction. +Note that when the switch statement is reached, ``next_instr`` (the "instruction offset") already points to the next instruction. Thus, jump instructions can be implemented by manipulating ``next_instr``: - An absolute jump (``JUMP_ABSOLUTE``) sets ``next_instr = first_instr + oparg``. @@ -148,11 +114,12 @@ The size of the inline cache for a particular instruction is fixed by its ``opco Cache entries are reserved by the compiler and initialized with zeros. If an instruction has an inline cache, the layout of its cache can be described by a ``struct`` definition and the address of the cache is given by casting ``next_instr`` to a pointer to the cache ``struct``. The size of such a ``struct`` must be independent of the machine architecture and word size. -Even though inline cache entries are represented by code units, they do not have to conform to the ``opcode``/``oparg`` format. +Even though inline cache entries are represented by code units, they do not have to conform to the ``opcode`` / ``oparg`` format. The instruction implementation is responsible for advancing ``next_instr`` past the inline cache. For example, if an instruction's inline cache is four bytes (two code units) in size, the code for the instruction must contain ``next_instr += 2;``. This is equivalent to a relative forward jump by that many code units. +(The proper way to code this is ``JUMPBY(n)``, where ``n`` is the number of code units to jump, typically given as a named constant.) Serializing non-zero cache entries would present a problem because the serialization (``marshal``) format must be independent of the machine byte order. @@ -175,6 +142,12 @@ At any point during execution, the stack level is knowable based on the instruct In particular, only a few instructions may push a ``NULL`` onto the stack, and the positions that may be ``NULL`` are known. A few other instructions (``GET_ITER``, ``FOR_ITER``) push or pop an object that is known to be an interator. +Instruction sequences that do not allow statically knowing the stack depth are deemed illegal (and never generated by the bytecode compiler). +For example, the following sequence is illegal, because it keeps pushing items on the stack:: + + LOAD_FAST 0 + JUMP_BACKWARD 2 + Do not confuse the evaluation stack with the call stack, which is used to implement calling and returning from functions. Error handling @@ -239,8 +212,17 @@ For C code, you have to call ``PyCode_Addr2Location()``. Fortunately, the locations table is only consulted by exception handling (to set ``tb_lineno``) and by tracing (to pass the line number to the tracing function). In order to reduce the overhead during tracing, the mapping from instruction offset to linenumber is cached in the ``_co_linearray`` field. -XXX exception chaining ----------------------- +Exception chaining +------------------ + +When an exception is raised during exception handling, the new exception is chained to the old one. +This is done by making the ``__context__`` field of the new exception point to the old one. +This is the responsibility of ``_PyErr_SetObject()`` (which is ultimately called by all ``PyErr_Set*()`` functions). +Separately, if a statement of the form ``raise X from Y`` is executed, the ``__cause__`` field of the raised exception (``X``) is set to ``Y``. +This is done by ``PyException_SetCause()``, called in response to all ``RAISE_VARARGS`` instructions. +A special case is ``raise X from None``, which sets the ``__cause__`` field to ``None`` (at the C level, it sets ``cause`` to ``NULL``). + +XXX Other exception details Python-to-Python calls ====================== From bbd233b1581c83e0708f5ba5ffb0272323872260 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Sat, 3 Sep 2022 16:57:47 -0700 Subject: [PATCH 04/12] A bit on the frame stack --- internals/interpreter.rst | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 4abb9db267..3dc3f80e07 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -93,6 +93,8 @@ The following loop, to be inserted just above the ``switch`` statement, will mak oparg = (oparg << 8) | _Py_OPARG(word); } +For various reasons the actual decoding code is more complicated; we'll get to the reasons. + Jumps ===== @@ -235,7 +237,7 @@ This approach is very general but consumes several C stack frames for each neste In 3.11 the ``CALL`` instruction special-cases function objects to "inline" the call. When a call gets inlined, a new frame gets pushed onto the call stack and the interpreter "jumps" to the start of the callee's bytecode. When the callee executes a ``RETURN_VALUE`` instruction, the frame is popped off the call stack and the interpreter returns to the caller. -There is a flag in the frame (``frame->is_entry``) that indicates whether the frame was inlined. +There is a flag in the frame (``frame->is_entry``) that indicates whether the frame was inlined (set if it wasn't). If ``RETURN_VALUE`` returns to a caller where this flag is set, it performs the usual cleanup and return from ``_PyEval_EvalFrameDefault()``. A similar check is performed when an unhandled exception occurs. @@ -243,6 +245,29 @@ A similar check is performed when an unhandled exception occurs. The call stack ============== +Up through 3.10 the call stack used to be implemented as a singly-linked list of ``PyFrameObject``s. +This was expensive because each call would require a heap allocation for the stack frame. +(There was some optimization using a free list, but this was not always effective, because frames are variable length.) + +In 3.11 frames are no longer fully-fledged objects. +Instead, a leaner internal ``_PyInterpreterFrame`` structure is used, which is allocated using a custom allocator, ``_PyThreadState_BumpFramePointer()``. +Usually a frame allocation is just a pointer bump, which improves memory locality. +The function ``_PyEvalFramePushAndInit()`` allocates and initializes a frame structure. + +Sometimes an actual ``PyFrameObject`` is needed, usually because some Python code calls ``sys._getframe()`` or an extension module calls ``PyEval_GetFrame()``. +In this case we allocate a proper ``PyFrameObject`` and initialize it from the ``_PyInterpreterFrame``. +This would be a pessimization, but fortunately this happens rarely (introspecting frames is not a common operation). + +Things get more complicated when generators are involved, since those don't follow the push/pop model. +(The same applies to async functions, which are implemented using the same infrastructure.) +A generator object has space for a ``_PyInterpreterFrame`` structure, including the variable-size part (used for locals and eval stack). +When a generator (or async) function is first called, a special opcode, ``RETURN_GENERATOR`` is executed, which is responsible for creating the generator object. +The generator object's ``_PyInterpreterFrame`` is initialized with a copy of the current stack frame. +The current stack frame is then popped off the stack and the generator object is returned. +(Details differ depending on the ``is_entry`` flag.) +When the generator is resumed, the interpreter pushes the ``_PyInterpreterFrame`` onto the stack and resumes execution. +(There is more hairiness for generators and their ilk, we'll discuss these in a later section in more detail.) + XXX Also frame layout and use, locals "plus" All sorts of variables From 772c2138d02e338796d11b729a720de983eca8c8 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Sat, 14 Jan 2023 11:13:06 -0800 Subject: [PATCH 05/12] Tie a bow on it and accept that it's not going to be finished right now --- internals/interpreter.rst | 42 +++++++++++++++++++++------------------ 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 3dc3f80e07..30d7be4c5d 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -18,7 +18,7 @@ Introduction The bytecode interpreter's job is to execute Python code. Its main input is a code object, although this is not a direct argument to the interpreter. -The interpreter is structured as a (potentially recursive) function taking a thread state (``tstate``) and a stack frame (``frame``). +The interpreter is structured as a (recursive) function taking a thread state (``tstate``) and a stack frame (``frame``). The function also takes an integer ``throwflag``, which is used by the implementation of ``generator.throw()``. It returns a new reference to a Python object (``PyObject *``) or an error indicator, ``NULL``. Since :pep:`523` this function is configurable by setting ``interp->eval_frame``; we describe only the default function, ``_PyEval_EvalFrameDefault()``. @@ -71,6 +71,7 @@ A simple instruction decoding loop would look like this:: unsigned int oparg = _Py_OPARG(word); switch (opcode) { // ... A case for each opcode ... + } } This format supports 256 different opcodes, which is sufficient. @@ -93,7 +94,7 @@ The following loop, to be inserted just above the ``switch`` statement, will mak oparg = (oparg << 8) | _Py_OPARG(word); } -For various reasons the actual decoding code is more complicated; we'll get to the reasons. +For various reasons (mostly efficiency, given that ``EXTENDED_ARG`` is rare) the actual code is different; we'll get to the reasons. Jumps ===== @@ -110,16 +111,18 @@ A relative jump whose ``oparg`` is zero is a no-op. Inline cache entries ==================== -Some (usually specialized) instructions have an associated "inline cache". +Some (specialized or specializable) instructions have an associated "inline cache". The inline cache consists of one or more two-byte entries included in the bytecode array. The size of the inline cache for a particular instruction is fixed by its ``opcode`` alone. +Moreover, the inline cache size for a family of specialized/specializable instructions (e.g., ``LOAD_ATTR``, ``LOAD_ATTR_SLOT``, ``LOAD_ATTR_MODULE``) must all be the same. Cache entries are reserved by the compiler and initialized with zeros. If an instruction has an inline cache, the layout of its cache can be described by a ``struct`` definition and the address of the cache is given by casting ``next_instr`` to a pointer to the cache ``struct``. -The size of such a ``struct`` must be independent of the machine architecture and word size. +The size of such a ``struct`` must be independent of the machine architecture, word size and alignment requirements. +For 32-bit fields, the ``struct`` should use ``_Py_CODEUNIT field[2]``. Even though inline cache entries are represented by code units, they do not have to conform to the ``opcode`` / ``oparg`` format. The instruction implementation is responsible for advancing ``next_instr`` past the inline cache. -For example, if an instruction's inline cache is four bytes (two code units) in size, the code for the instruction must contain ``next_instr += 2;``. +For example, if an instruction's inline cache is four bytes (i.e., two code units) in size, the code for the instruction must contain ``next_instr += 2;``. This is equivalent to a relative forward jump by that many code units. (The proper way to code this is ``JUMPBY(n)``, where ``n`` is the number of code units to jump, typically given as a named constant.) @@ -131,8 +134,8 @@ The evaluation stack ==================== Apart from unconditional jumps, almost all instructions read or write some data in the form of object references (``PyObject *``). -The CPython bytecode interpreter is a stack machine, meaning that it operates by pushing data onto and popping it off the stack. -For example, the "add" instruction (which used to be called ``BINARY_ADD`` but is now ``BINARY_OP 0``) pops two objects off the stack and pushes the result back onto the stack. +The CPython 3.11 bytecode interpreter is a stack machine, meaning that it operates by pushing data onto and popping it off the stack. +For example, the "add" instruction (which used to be called ``BINARY_ADD`` in 3.10 but is now ``BINARY_OP 0``) pops two objects off the stack and pushes the result back onto the stack. An interesting property of the CPython bytecode interpreter is that the stack size required to evaluate a given function is known in advance. The stack size is computed by the bytecode compiler and is stored in ``code->co_stacksize``. The interpreter uses this information to allocate stack. @@ -144,7 +147,8 @@ At any point during execution, the stack level is knowable based on the instruct In particular, only a few instructions may push a ``NULL`` onto the stack, and the positions that may be ``NULL`` are known. A few other instructions (``GET_ITER``, ``FOR_ITER``) push or pop an object that is known to be an interator. -Instruction sequences that do not allow statically knowing the stack depth are deemed illegal (and never generated by the bytecode compiler). +Instruction sequences that do not allow statically knowing the stack depth are deemed illegal. +The bytecode compiler never generates such sequences. For example, the following sequence is illegal, because it keeps pushing items on the stack:: LOAD_FAST 0 @@ -155,7 +159,7 @@ Do not confuse the evaluation stack with the call stack, which is used to implem Error handling ============== -When an instruction like encounters an error, an exception is raised. +When an instruction like ``BINARY_OP`` encounters an error, an exception is raised. At this point a traceback entry is added to the exception (by ``PyTraceBack_Here()``) and cleanup is performed. In the simplest case (absent any ``try`` blocks) this results in the remaining objects being popped off the evaluation stack and their reference count (if not ``NULL``) decremented. Then the interpreter function (``_PyEval_EvalFrameDefault()``) returns ``NULL``. @@ -177,7 +181,7 @@ The table is conceptually a list of records, each containing four variable-lengt - depth_and_lasti: the low bit gives the "lasti" flag, the remaining bits give the stack depth The stack depth is used to clean up evaluation stack entries above this depth. -The "lasti" flag indicates whether, after stack cleanup, the instruction offset of the raising instruction should be pushed. +The "lasti" flag indicates whether, after stack cleanup, the instruction offset of the raising instruction should be pushed (as a ``PyLongObject *``). For more information on the design, see the file ``Objects/exception_handling_notes.txt``. Each varint is encoded as one or more bytes. @@ -209,7 +213,7 @@ For Python code, a convenient method exists, ``co_positions()``, which returns a There is also ``co_lines()`` which returns an interator of *(start, end, line)* tuples, where *start* and *end* are bytecode offsets. The latter is described by :pep:`626`. It is more compact, but doesn't return end line numbers or column offsets. -For C code, you have to call ``PyCode_Addr2Location()``. +From C code, you have to call ``PyCode_Addr2Location()``. Fortunately, the locations table is only consulted by exception handling (to set ``tb_lineno``) and by tracing (to pass the line number to the tracing function). In order to reduce the overhead during tracing, the mapping from instruction offset to linenumber is cached in the ``_co_linearray`` field. @@ -224,7 +228,7 @@ Separately, if a statement of the form ``raise X from Y`` is executed, the ``__c This is done by ``PyException_SetCause()``, called in response to all ``RAISE_VARARGS`` instructions. A special case is ``raise X from None``, which sets the ``__cause__`` field to ``None`` (at the C level, it sets ``cause`` to ``NULL``). -XXX Other exception details +(TODO: Other exception details.) Python-to-Python calls ====================== @@ -245,7 +249,7 @@ A similar check is performed when an unhandled exception occurs. The call stack ============== -Up through 3.10 the call stack used to be implemented as a singly-linked list of ``PyFrameObject``s. +Up through 3.10 the call stack used to be implemented as a singly-linked list of ``PyFrameObject`` objects. This was expensive because each call would require a heap allocation for the stack frame. (There was some optimization using a free list, but this was not always effective, because frames are variable length.) @@ -256,7 +260,7 @@ The function ``_PyEvalFramePushAndInit()`` allocates and initializes a frame str Sometimes an actual ``PyFrameObject`` is needed, usually because some Python code calls ``sys._getframe()`` or an extension module calls ``PyEval_GetFrame()``. In this case we allocate a proper ``PyFrameObject`` and initialize it from the ``_PyInterpreterFrame``. -This would be a pessimization, but fortunately this happens rarely (introspecting frames is not a common operation). +This is a pessimization, but fortunately this happens rarely (introspecting frames is not a common operation). Things get more complicated when generators are involved, since those don't follow the push/pop model. (The same applies to async functions, which are implemented using the same infrastructure.) @@ -268,7 +272,7 @@ The current stack frame is then popped off the stack and the generator object is When the generator is resumed, the interpreter pushes the ``_PyInterpreterFrame`` onto the stack and resumes execution. (There is more hairiness for generators and their ilk, we'll discuss these in a later section in more detail.) -XXX Also frame layout and use, locals "plus" +(TODO: Also frame layout and use, and "locals plus".) All sorts of variables ====================== @@ -281,12 +285,12 @@ The key types of variables are: - globals and builtins: the compiler does not distinguish between globals and builtins (though the specializing interpreter does) - cells -- used for nonlocal references -XXX More +(TODO: Write the rest of this section. Alas, the author got distracted and won't have time to continue this for a while.) -XXX Getting variable names +Other topics +============ -XXX More -======== +(TODO: Each of the following probably deserves its own section.) - co_consts, co_names, co_varnames, and their ilk - How calls work (how args are transferred, return, exceptions) From 1f9544085d7b0441032928d06f9961fd7d2824be Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Sat, 14 Jan 2023 12:09:01 -0800 Subject: [PATCH 06/12] Fix 'make check' --- internals/interpreter.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 30d7be4c5d..38a321d7eb 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -166,7 +166,7 @@ Then the interpreter function (``_PyEval_EvalFrameDefault()``) returns ``NULL``. However, if an exception is raised in a ``try`` block, the interpreter must jump to the corresponding ``except`` or ``finally`` block. In 3.10 and before there was a separate "block stack" which was used to keep track of nesting ``try`` blocks. -In 3.11 this mechanism has been replaced by a statically generated table, `code->co_exceptiontable``. +In 3.11 this mechanism has been replaced by a statically generated table, ``code->co_exceptiontable``. The advantage of this approach is that entering and leaving a ``try`` block normally does not execute any code, making execution faster. But of course the table needs to be generated by the compiler, and decoded (by ``get_exception_handler``) when an exception happens. @@ -175,9 +175,9 @@ Exception table format The table is conceptually a list of records, each containing four variable-length integer fields (in a unique format, see below): -- start: start of `try` block, in code units from the start of the bytecode -- length: size of the `try` block, in code units -- target: start of the first instruction of the `except` or `finally` block, in code units from the start of the bytecode +- start: start of ``try`` block, in code units from the start of the bytecode +- length: size of the ``try`` block, in code units +- target: start of the first instruction of the ``except`` or ``finally`` block, in code units from the start of the bytecode - depth_and_lasti: the low bit gives the "lasti" flag, the remaining bits give the stack depth The stack depth is used to clean up evaluation stack entries above this depth. From 78faee2dec85df07d5f8de3433d68b47c8bd4330 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Mon, 16 Jan 2023 09:00:40 -0800 Subject: [PATCH 07/12] Apply nearly all of CAM's improvement (And one by HvK) Co-authored-by: C.A.M. Gerlach Co-authored-by: Hugo van Kemenade --- internals/index.rst | 2 +- internals/interpreter.rst | 112 ++++++++++++++++++++------------------ 2 files changed, 59 insertions(+), 55 deletions(-) diff --git a/internals/index.rst b/internals/index.rst index 335ac8d350..1611135609 100644 --- a/internals/index.rst +++ b/internals/index.rst @@ -8,5 +8,5 @@ CPython's Internals exploring parser compiler - garbage-collector interpreter + garbage-collector diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 38a321d7eb..10e92cf747 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -1,27 +1,27 @@ .. _interpreter: -===================================== -The CPython 3.11 Bytecode Interpreter -===================================== +=============================== +The Bytecode Interpreter (3.11) +=============================== -.. highlight:: none +.. highlight:: c Preface ======= The CPython 3.11 bytecode interpreter (a.k.a. virtual machine) has a number of improvements over 3.10. We describe the inner workings of the 3.11 interpreter here, with an emphasis on understanding not just the code but its design. -While the interpreter is forever evolving, and the 3.12 design will undoubtedly be different again, understanding the 3.11 design will help you understand future improvements to the interpreter. +While the interpreter is forever evolving, and the 3.12 design will undoubtedly be different again, knowing the 3.11 design will help you understand future improvements to the interpreter. Introduction ============ -The bytecode interpreter's job is to execute Python code. +The job of the bytecode interpreter, in :cpy-file:`Python/ceval.c`, is to execute Python code. Its main input is a code object, although this is not a direct argument to the interpreter. The interpreter is structured as a (recursive) function taking a thread state (``tstate``) and a stack frame (``frame``). -The function also takes an integer ``throwflag``, which is used by the implementation of ``generator.throw()``. +The function also takes an integer ``throwflag``, which is used by the implementation of :func:`generator.throw`. It returns a new reference to a Python object (``PyObject *``) or an error indicator, ``NULL``. -Since :pep:`523` this function is configurable by setting ``interp->eval_frame``; we describe only the default function, ``_PyEval_EvalFrameDefault()``. +Per :pep:`523`, this function is configurable by setting ``interp->eval_frame``; we describe only the default function, ``_PyEval_EvalFrameDefault()``. (This function's signature has evolved and no longer matches what PEP 523 specifies; the thread state argument is added and the stack frame argument is no longer an object.) The interpreter finds the code object by looking in the stack frame (``frame->f_code``). @@ -33,20 +33,20 @@ Note the slightly confusing terminology here. "Interpreter" refers to the bytecode interpreter, a recursive function. "Interpreter state" refers to state shared by threads, each of which may be running its own bytecode interpreter. A single process may even host multiple interpreters, each with their own interpreter state, but sharing runtime state. -The topic of multiple interpreters is covered by several PEPs, notably :pep:`684`, :pep:`630`, and :pep:`554` (and more coming). +The topic of multiple interpreters is covered by several PEPs, notably :pep:`684`, :pep:`630`, and :pep:`554` (with more coming). The current document focuses on the bytecode interpreter. Code objects ============ -The interpreter uses as its starting point a code object (``frame->f_code``). +The interpreter uses a code object (``frame->f_code``) as its starting point. Code objects contain many fields used by the interpreter, as well as some for use by debuggers and other tools. In 3.11, the final field of a code object is an array of indeterminate length containing the bytecode, ``code->co_code_adaptive``. -(In previous versions the code object was a ``bytes`` object, ``code->co_code``; it was changed to save an allocation and to allow it to be mutated.) +(In previous versions the code object was a :class:`bytes` object, ``code->co_code``; it was changed to save an allocation and to allow it to be mutated.) -Code objects are typically produced by the bytecode :ref:`compiler`, although often they are written to disk by one process and read back in by another. -The disk version of a code object is serialized using the `marshal protocol `_. -Some code objects are pre-loaded into the interpreter using ``Tools/scripts/deepfreeze.py``, which writes ``Python/deepfreeze/deepfreeze.c``. +Code objects are typically produced by the bytecode :ref:`compiler `, although they are often written to disk by one process and read back in by another. +The disk version of a code object is serialized using the :mod:`marshal` protocol. +Some code objects are pre-loaded into the interpreter using :cpy-file:`Tools/scripts/deepfreeze.py`, which writes :cpy-file:`Python/deepfreeze/deepfreeze.c`. Code objects are nominally immutable. Some fields (including ``co_code_adaptive``) are mutable, but mutable fields are not included when code objects are hashed or compared. @@ -61,7 +61,9 @@ In order to make the bytecode format independent of the machine byte order when Macros are used to extract the ``opcode`` and ``oparg`` from a code unit (``_Py_OPCODE(word)`` and ``_Py_OPARG(word)``). Some instructions (e.g. ``NOP`` or ``POP_TOP``) have no argument -- in this case we ignore ``oparg``. -A simple instruction decoding loop would look like this:: +A simple instruction decoding loop would look like this: + +.. code-block:: c _Py_CODEUNIT *first_instr = code->co_code_adaptive; _Py_CODEUNIT *next_instr = first_instr; @@ -81,12 +83,14 @@ For example, this sequence of code units:: EXTENDED_ARG 1 EXTENDED_ARG 0 - LOAD_CONST 0 + LOAD_CONST 2 -would set ``opcode`` to ``LOAD_CONST`` and ``oparg`` to ``65536`` (i.e., ``0x1_00_00``). +would set ``opcode`` to ``LOAD_CONST`` and ``oparg`` to ``65538`` (i.e., ``0x1_00_02``). The compiler should limit itself to at most three ``EXTENDED_ARG`` prefixes, to allow the resulting ``oparg`` to fit in 32 bits, but the interpreter does not check this. -A series of code units starting with ``EXTENDED_ARG`` is called a complete instruction, to distinguish it from code unit, which is always two bytes. -The following loop, to be inserted just above the ``switch`` statement, will make it decode a complete instruction:: +A series of code units starting with zero to three ``EXTENDED_ARG`` opcodes followed by a primary opcode is called a complete instruction, to distinguish it from a single code unit, which is always two bytes. +The following loop, to be inserted just above the ``switch`` statement, will make the above snippet decode a complete instruction: + +.. code-block:: c while (opcode == EXTENDED_ARG) { word = *next_instr++; @@ -94,12 +98,12 @@ The following loop, to be inserted just above the ``switch`` statement, will mak oparg = (oparg << 8) | _Py_OPARG(word); } -For various reasons (mostly efficiency, given that ``EXTENDED_ARG`` is rare) the actual code is different; we'll get to the reasons. +For various reasons we'll get to later (mostly efficiency, given that ``EXTENDED_ARG`` is rare) the actual code is different. Jumps ===== -Note that when the switch statement is reached, ``next_instr`` (the "instruction offset") already points to the next instruction. +Note that when the ``switch`` statement is reached, ``next_instr`` (the "instruction offset") already points to the next instruction. Thus, jump instructions can be implemented by manipulating ``next_instr``: - An absolute jump (``JUMP_ABSOLUTE``) sets ``next_instr = first_instr + oparg``. @@ -126,9 +130,9 @@ For example, if an instruction's inline cache is four bytes (i.e., two code unit This is equivalent to a relative forward jump by that many code units. (The proper way to code this is ``JUMPBY(n)``, where ``n`` is the number of code units to jump, typically given as a named constant.) -Serializing non-zero cache entries would present a problem because the serialization (``marshal``) format must be independent of the machine byte order. +Serializing non-zero cache entries would present a problem because the serialization (:mod`marshal`) format must be independent of the machine byte order. -More information about the use of inline caches can be found in :pep:`659` (search for "ancillary data"). +More information about the use of inline caches :pep:`can be found in PEP 659 <659#ancillary-data>`. The evaluation stack ==================== @@ -145,7 +149,7 @@ There is no overflow or underflow check (except when compiled in debug mode) -- At any point during execution, the stack level is knowable based on the instruction pointer alone, and some properties of each item on the stack are also known. In particular, only a few instructions may push a ``NULL`` onto the stack, and the positions that may be ``NULL`` are known. -A few other instructions (``GET_ITER``, ``FOR_ITER``) push or pop an object that is known to be an interator. +A few other instructions (``GET_ITER``, ``FOR_ITER``) push or pop an object that is known to be an iterator. Instruction sequences that do not allow statically knowing the stack depth are deemed illegal. The bytecode compiler never generates such sequences. @@ -160,15 +164,15 @@ Error handling ============== When an instruction like ``BINARY_OP`` encounters an error, an exception is raised. -At this point a traceback entry is added to the exception (by ``PyTraceBack_Here()``) and cleanup is performed. -In the simplest case (absent any ``try`` blocks) this results in the remaining objects being popped off the evaluation stack and their reference count (if not ``NULL``) decremented. +At this point, a traceback entry is added to the exception (by ``PyTraceBack_Here()``) and cleanup is performed. +In the simplest case (absent any ``try`` blocks), this results in the remaining objects being popped off the evaluation stack and their reference count decremented (if not ``NULL``) . Then the interpreter function (``_PyEval_EvalFrameDefault()``) returns ``NULL``. However, if an exception is raised in a ``try`` block, the interpreter must jump to the corresponding ``except`` or ``finally`` block. -In 3.10 and before there was a separate "block stack" which was used to keep track of nesting ``try`` blocks. -In 3.11 this mechanism has been replaced by a statically generated table, ``code->co_exceptiontable``. +In 3.10 and before, there was a separate "block stack" which was used to keep track of nesting ``try`` blocks. +In 3.11, this mechanism has been replaced by a statically generated table, ``code->co_exceptiontable``. The advantage of this approach is that entering and leaving a ``try`` block normally does not execute any code, making execution faster. -But of course the table needs to be generated by the compiler, and decoded (by ``get_exception_handler``) when an exception happens. +But of course, this table needs to be generated by the compiler, and decoded (by ``get_exception_handler``) when an exception happens. Exception table format ---------------------- @@ -182,7 +186,7 @@ The table is conceptually a list of records, each containing four variable-lengt The stack depth is used to clean up evaluation stack entries above this depth. The "lasti" flag indicates whether, after stack cleanup, the instruction offset of the raising instruction should be pushed (as a ``PyLongObject *``). -For more information on the design, see the file ``Objects/exception_handling_notes.txt``. +For more information on the design, see :cpy-file:`Objects/exception_handling_notes.txt`. Each varint is encoded as one or more bytes. The high bit (bit 7) is reserved for random access -- it is set for the first varint of a record. @@ -191,42 +195,41 @@ The low 6 bits (bits 0-5) are used for the integer value, in big-endian order. To find the table entry (if any) for a given instruction offset, we can use bisection without decoding the whole table. We bisect the raw bytes, at each probe finding the start of the record by scanning back for a byte with the high bit set, and then decode the first varint. -See ``get_exception_handler()`` for the exact code (like all bisection algorithms, the code is a bit subtle). +See ``get_exception_handler()`` in :cpy-file:`Python/ceval.c` for the exact code (like all bisection algorithms, the code is a bit subtle). The locations table ------------------- Whenever an exception is raised, we add a traceback entry to the exception. The ``tb_lineno`` field of a traceback entry must be set to the line number of the instruction that raised it. -This field is computed from the locations table, ``co_linetable`` (this name is an understatement), using ``PyCode_Addr2Line()``. +This field is computed from the locations table, ``co_linetable`` (this name is an understatement), using :c:func:`PyCode_Addr2Line`. This table has an entry for every instruction rather than for every ``try`` block, so a compact format is very important. -The full design of the 3.11 locations table is written up in ``Objects/locations.md``. +The full design of the 3.11 locations table is written up in :cpy-file:`Objects/locations.md`. While there are rumors that this file is slightly out of date, it is still the best reference we have. -Don't be confused by ``lnotab_notes.txt``, which describes the 3.10 format. +Don't be confused by :cpy-file:`Objects/lnotab_notes.txt`, which describes the 3.10 format. For backwards compatibility this format is still supported by the ``co_lnotab`` property. The 3.11 location table format is different because it stores not just the starting line number for each instruction, but also the end line number, *and* the start and end column numbers. Note that traceback objects don't store all this information -- they store the start line number, for backward compatibility, and the "last instruction" value. The rest can be computed from the last instruction (``tb_lasti``) with the help of the locations table. -For Python code, a convenient method exists, ``co_positions()``, which returns an iterator of *(line, endline, column, endcolumn)* tuples, one per instruction. -There is also ``co_lines()`` which returns an interator of *(start, end, line)* tuples, where *start* and *end* are bytecode offsets. -The latter is described by :pep:`626`. -It is more compact, but doesn't return end line numbers or column offsets. -From C code, you have to call ``PyCode_Addr2Location()``. +For Python code, a convenient method exists, :meth:`~codeobject.co_positions`, which returns an iterator of :samp:`({line}, {endline}, {column}, {endcolumn})` tuples, one per instruction. +There is also ``co_lines()`` which returns an iterator of :samp:`({start}, {end}, {line})` tuples, where :samp:`{start}` and :samp:`{end}` are bytecode offsets. +The latter is described by :pep:`626`; it is more compact, but doesn't return end line numbers or column offsets. +From C code, you have to call :c:func:`PyCode_Addr2Location`. Fortunately, the locations table is only consulted by exception handling (to set ``tb_lineno``) and by tracing (to pass the line number to the tracing function). -In order to reduce the overhead during tracing, the mapping from instruction offset to linenumber is cached in the ``_co_linearray`` field. +In order to reduce the overhead during tracing, the mapping from instruction offset to line number is cached in the ``_co_linearray`` field. Exception chaining ------------------ When an exception is raised during exception handling, the new exception is chained to the old one. This is done by making the ``__context__`` field of the new exception point to the old one. -This is the responsibility of ``_PyErr_SetObject()`` (which is ultimately called by all ``PyErr_Set*()`` functions). -Separately, if a statement of the form ``raise X from Y`` is executed, the ``__cause__`` field of the raised exception (``X``) is set to ``Y``. -This is done by ``PyException_SetCause()``, called in response to all ``RAISE_VARARGS`` instructions. -A special case is ``raise X from None``, which sets the ``__cause__`` field to ``None`` (at the C level, it sets ``cause`` to ``NULL``). +This is the responsibility of ``_PyErr_SetObject()`` in :cpy-file:`Python/errors.c` (which is ultimately called by all ``PyErr_Set*()`` functions). +Separately, if a statement of the form :samp:`raise {X} from {Y}` is executed, the ``__cause__`` field of the raised exception (:samp:`{X}`) is set to :samp:`{Y}`. +This is done by :c:func:`PyException_SetCause`, called in response to all ``RAISE_VARARGS`` instructions. +A special case is `:samp:`raise {X} from None`, which sets the ``__cause__`` field to ``None`` (at the C level, it sets ``cause`` to ``NULL``). (TODO: Other exception details.) @@ -234,11 +237,11 @@ Python-to-Python calls ====================== The ``_PyEval_EvalFrameDefault()`` function is recursive, because sometimes the interpreter calls some C function that calls back into the interpreter. -In 3.10 and before this was the case even when a Python function called another Python function: +In 3.10 and before, this was the case even when a Python function called another Python function: The ``CALL`` instruction would call the ``tp_call`` dispatch function of the callee, which would extract the code object, create a new frame for the call stack, and then call back into the interpreter. This approach is very general but consumes several C stack frames for each nested Python call, thereby increasing the risk of an (unrecoverable) C stack overflow. -In 3.11 the ``CALL`` instruction special-cases function objects to "inline" the call. +In 3.11, the ``CALL`` instruction special-cases function objects to "inline" the call. When a call gets inlined, a new frame gets pushed onto the call stack and the interpreter "jumps" to the start of the callee's bytecode. When the callee executes a ``RETURN_VALUE`` instruction, the frame is popped off the call stack and the interpreter returns to the caller. There is a flag in the frame (``frame->is_entry``) that indicates whether the frame was inlined (set if it wasn't). @@ -249,41 +252,42 @@ A similar check is performed when an unhandled exception occurs. The call stack ============== -Up through 3.10 the call stack used to be implemented as a singly-linked list of ``PyFrameObject`` objects. +Up through 3.10, the call stack used to be implemented as a singly-linked list of :c:type:`PyFrameObject` objects. This was expensive because each call would require a heap allocation for the stack frame. (There was some optimization using a free list, but this was not always effective, because frames are variable length.) -In 3.11 frames are no longer fully-fledged objects. +In 3.11, frames are no longer fully-fledged objects. Instead, a leaner internal ``_PyInterpreterFrame`` structure is used, which is allocated using a custom allocator, ``_PyThreadState_BumpFramePointer()``. Usually a frame allocation is just a pointer bump, which improves memory locality. The function ``_PyEvalFramePushAndInit()`` allocates and initializes a frame structure. -Sometimes an actual ``PyFrameObject`` is needed, usually because some Python code calls ``sys._getframe()`` or an extension module calls ``PyEval_GetFrame()``. +Sometimes an actual ``PyFrameObject`` is needed, usually because some Python code calls :func:`sys._getframe` or an extension module calls :c:func:`PyEval_GetFrame`. In this case we allocate a proper ``PyFrameObject`` and initialize it from the ``_PyInterpreterFrame``. -This is a pessimization, but fortunately this happens rarely (introspecting frames is not a common operation). +This is a pessimization, but fortunately happens rarely (as introspecting frames is not a common operation). Things get more complicated when generators are involved, since those don't follow the push/pop model. (The same applies to async functions, which are implemented using the same infrastructure.) A generator object has space for a ``_PyInterpreterFrame`` structure, including the variable-size part (used for locals and eval stack). -When a generator (or async) function is first called, a special opcode, ``RETURN_GENERATOR`` is executed, which is responsible for creating the generator object. +When a generator (or async) function is first called, a special opcode ``RETURN_GENERATOR`` is executed, which is responsible for creating the generator object. The generator object's ``_PyInterpreterFrame`` is initialized with a copy of the current stack frame. The current stack frame is then popped off the stack and the generator object is returned. (Details differ depending on the ``is_entry`` flag.) When the generator is resumed, the interpreter pushes the ``_PyInterpreterFrame`` onto the stack and resumes execution. -(There is more hairiness for generators and their ilk, we'll discuss these in a later section in more detail.) +(There is more hairiness for generators and their ilk; we'll discuss these in a later section in more detail.) (TODO: Also frame layout and use, and "locals plus".) All sorts of variables ====================== -The bytecode compiler determines for each variable name in which scope it is defined and generates instructions accordingly. +The bytecode compiler determines the scope in which each variable name is defined, and generates instructions accordingly. For example, loading a local variable onto the stack is done using ``LOAD_FAST``, while loading a global is done using ``LOAD_GLOBAL``. The key types of variables are: + - fast locals: used in functions - (slow or regular) locals: used in classes and at the top level - globals and builtins: the compiler does not distinguish between globals and builtins (though the specializing interpreter does) -- cells -- used for nonlocal references +- cells: used for nonlocal references (TODO: Write the rest of this section. Alas, the author got distracted and won't have time to continue this for a while.) From c9ae3dd1704c95dc07eae1b3f0d2b3b2d2ce1598 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Mon, 16 Jan 2023 09:14:01 -0800 Subject: [PATCH 08/12] Fix markup --- internals/interpreter.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 10e92cf747..2fde66c58f 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -19,7 +19,7 @@ Introduction The job of the bytecode interpreter, in :cpy-file:`Python/ceval.c`, is to execute Python code. Its main input is a code object, although this is not a direct argument to the interpreter. The interpreter is structured as a (recursive) function taking a thread state (``tstate``) and a stack frame (``frame``). -The function also takes an integer ``throwflag``, which is used by the implementation of :func:`generator.throw`. +The function also takes an integer ``throwflag``, which is used by the implementation of ``generator.throw``. It returns a new reference to a Python object (``PyObject *``) or an error indicator, ``NULL``. Per :pep:`523`, this function is configurable by setting ``interp->eval_frame``; we describe only the default function, ``_PyEval_EvalFrameDefault()``. (This function's signature has evolved and no longer matches what PEP 523 specifies; the thread state argument is added and the stack frame argument is no longer an object.) @@ -130,7 +130,7 @@ For example, if an instruction's inline cache is four bytes (i.e., two code unit This is equivalent to a relative forward jump by that many code units. (The proper way to code this is ``JUMPBY(n)``, where ``n`` is the number of code units to jump, typically given as a named constant.) -Serializing non-zero cache entries would present a problem because the serialization (:mod`marshal`) format must be independent of the machine byte order. +Serializing non-zero cache entries would present a problem because the serialization (:mod:`marshal`) format must be independent of the machine byte order. More information about the use of inline caches :pep:`can be found in PEP 659 <659#ancillary-data>`. @@ -229,7 +229,7 @@ This is done by making the ``__context__`` field of the new exception point to t This is the responsibility of ``_PyErr_SetObject()`` in :cpy-file:`Python/errors.c` (which is ultimately called by all ``PyErr_Set*()`` functions). Separately, if a statement of the form :samp:`raise {X} from {Y}` is executed, the ``__cause__`` field of the raised exception (:samp:`{X}`) is set to :samp:`{Y}`. This is done by :c:func:`PyException_SetCause`, called in response to all ``RAISE_VARARGS`` instructions. -A special case is `:samp:`raise {X} from None`, which sets the ``__cause__`` field to ``None`` (at the C level, it sets ``cause`` to ``NULL``). +A special case is :samp:`raise {X} from None`, which sets the ``__cause__`` field to ``None`` (at the C level, it sets ``cause`` to ``NULL``). (TODO: Other exception details.) From 47d6593bf8e5137f8b717400ba6e943a606febd8 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Mon, 16 Jan 2023 09:19:26 -0800 Subject: [PATCH 09/12] Don't link to deepfreeze.{py,c} Neither link works: - deepfreeze.py was moved between 3.11 and 3.12 - deepfreeze.c is not in the repo --- internals/interpreter.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 2fde66c58f..285dff7685 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -46,7 +46,7 @@ In 3.11, the final field of a code object is an array of indeterminate length co Code objects are typically produced by the bytecode :ref:`compiler `, although they are often written to disk by one process and read back in by another. The disk version of a code object is serialized using the :mod:`marshal` protocol. -Some code objects are pre-loaded into the interpreter using :cpy-file:`Tools/scripts/deepfreeze.py`, which writes :cpy-file:`Python/deepfreeze/deepfreeze.c`. +Some code objects are pre-loaded into the interpreter using ``Tools/scripts/deepfreeze.py``, which writes ``Python/deepfreeze/deepfreeze.c``. Code objects are nominally immutable. Some fields (including ``co_code_adaptive``) are mutable, but mutable fields are not included when code objects are hashed or compared. From a0f95aa96f3a02c4993af8c27b72564891eed98d Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Mon, 16 Jan 2023 09:25:39 -0800 Subject: [PATCH 10/12] Clarify where the inline cache lives --- internals/interpreter.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 285dff7685..253991f8e2 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -116,7 +116,7 @@ Inline cache entries ==================== Some (specialized or specializable) instructions have an associated "inline cache". -The inline cache consists of one or more two-byte entries included in the bytecode array. +The inline cache consists of one or more two-byte entries included in the bytecode array as additional words following the ``opcode`` /``oparg`` pair. The size of the inline cache for a particular instruction is fixed by its ``opcode`` alone. Moreover, the inline cache size for a family of specialized/specializable instructions (e.g., ``LOAD_ATTR``, ``LOAD_ATTR_SLOT``, ``LOAD_ATTR_MODULE``) must all be the same. Cache entries are reserved by the compiler and initialized with zeros. From 5378f8ee3fd87a2fcf0eb2f1c3c6aa5aaae35a3e Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Mon, 16 Jan 2023 09:27:31 -0800 Subject: [PATCH 11/12] Clarify the nature of the stack a bit --- internals/interpreter.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index 253991f8e2..efb2307521 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -139,6 +139,7 @@ The evaluation stack Apart from unconditional jumps, almost all instructions read or write some data in the form of object references (``PyObject *``). The CPython 3.11 bytecode interpreter is a stack machine, meaning that it operates by pushing data onto and popping it off the stack. +The stack is a pre-allocated array of object references. For example, the "add" instruction (which used to be called ``BINARY_ADD`` in 3.10 but is now ``BINARY_OP 0``) pops two objects off the stack and pushes the result back onto the stack. An interesting property of the CPython bytecode interpreter is that the stack size required to evaluate a given function is known in advance. The stack size is computed by the bytecode compiler and is stored in ``code->co_stacksize``. From 4b7a9710c5920ed9a497a741ab9b3cfe80f4a031 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Mon, 16 Jan 2023 09:34:49 -0800 Subject: [PATCH 12/12] Try to clarify what happens when an inlined call returns --- internals/interpreter.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/internals/interpreter.rst b/internals/interpreter.rst index efb2307521..5c5cfed03d 100644 --- a/internals/interpreter.rst +++ b/internals/interpreter.rst @@ -244,9 +244,10 @@ This approach is very general but consumes several C stack frames for each neste In 3.11, the ``CALL`` instruction special-cases function objects to "inline" the call. When a call gets inlined, a new frame gets pushed onto the call stack and the interpreter "jumps" to the start of the callee's bytecode. -When the callee executes a ``RETURN_VALUE`` instruction, the frame is popped off the call stack and the interpreter returns to the caller. +When an inlined callee executes a ``RETURN_VALUE`` instruction, the frame is popped off the call stack and the interpreter returns to its caller, +by popping a frame off the call stack and "jumping" to the return address. There is a flag in the frame (``frame->is_entry``) that indicates whether the frame was inlined (set if it wasn't). -If ``RETURN_VALUE`` returns to a caller where this flag is set, it performs the usual cleanup and return from ``_PyEval_EvalFrameDefault()``. +If ``RETURN_VALUE`` finds this flag set, it performs the usual cleanup and returns from ``_PyEval_EvalFrameDefault()`` altogether, to a C caller. A similar check is performed when an unhandled exception occurs.