We need to be consistent in our use of instruction/codeunit/bytecode/opcode, etc. #94437

markshannon · 2022-06-30T10:58:40Z

Documentation

We use the terms opcode, bytecode, instruction, and codeunit, in the code, comments and documentation.

However we aren't consistent, nor do we define those terms properly anywhere. The best docs are in dis.rst which is the wrong place for them.

A glossary

First of all we want some sort of glossary like this:

Instruction. The element of execution used by the front end to describe execution. All instructions have a name. Most, but not all, also have an operand
Execution-Unit: These can be considered to be the "real instructions" used by the interpreter. The assembler converts each instruction into zero or more execution-units. Instructions that are converted to anything but one execution-unit with the same name are called "pseudo-instructions".
Code-Unit: A pair of bytes consisting of an opcode and oparg. In the bytecode, an execution-unit is represented by one or more codeunits.
Bytecode: A sequence of codeunits that represents the code of a function, class or module (or other code entity).

Representation of instruction at runtime:
The assembler converts each instruction to zero or more execution-units, and each of those are converted to one or more code-units
An execution-unit is composed of:

Zero or more operand extensions. These are code units whose opcode == EXTENDED_ARG and whose oparg is 8 of the high bits of the instruction's operand.
One core code unit, whose opcode represents the name of the instruction, and whose oparg == (opcode & 255)
Zero or more cache entries. The exact number depends on the execution-unit name and is exactly determined by that name.

Although the bytecode, co.co_code, is presented as a sequence of bytes, it should be viewed as a sequence of codeunits, with the opcode preceding the oparg. The dis module will disassemble bytecode to a list of codeunits.

Why do this?

Doing this will expose inconsistencies in our terminology and tools and allow us to consider better tooling in the future.

For example, shouldn't dis output a list of instructions, not codeunits?

Could we support an assembler, allowing backwards compatible assembly code?
We could convert a list of 3.10 instructions to 3.11 bytecode. At the instruction level, they aren't so different, even though the bytecode is quite different.

The set of names is infinite, allowing us more flexibility to add new instructions, and support old ones.

Examples

The BINARY_ADD instruction is also an execution-unit in 3.10, but could be a pseudo-instruction in 3.11+
Likewise SETUP_FINALLY. The difference is that the 3.11 front-end emits SETUP_FINALLY, but not BINARY_ADD.

*Instruction: LOAD_METHOD "spam"
*Execution unit: LOAD_ATTR 515
*Code units: EXTENDED_ARG 2 LOAD_ATTR 3 CACHE 0*6

The text was updated successfully, but these errors were encountered:

arhadthedev · 2022-06-30T11:11:38Z

I have a feeling that all three instruction, execution unit and code unit can be seen as instructions, just of different tiers. The tiers can be named, for example:

stressing their fineness: generalized instructions, specialized instructions, finetuned instructions
or going LLVM-esque conveyor of frontend-middleend-backend: frontend instructions, backend instruction, and primitive instruction).

gvanrossum · 2022-07-22T22:33:40Z

I like instruction, pseudo-instruction, code-unit, and bytecode, but I'm not excited about execution-unit (too long, not quite self-explanatory enough, and I don't think we've been using that term). Maybe the latter could be "concrete instructions"? Longer, but more self-explanatory.

If this was a classical assembly language, those pseudo-instructions would be macros, and I wouldn't object to reusing that term instead of pseudo- (or virtual-?) instructions.

In Mark's definitions, are cache entries code-units or not? The definition of bytecode seems to imply they are, but the definition of code-unit seems to imply they're not (cache entries don't have an opcode and oparg).

Agreed that this documentation doesn't belong in dis.rst.

warsaw · 2022-07-22T23:58:44Z

Agreed that this documentation doesn't belong in dis.rst.

There's a lot of good information about how the CPython interpreter works in the devguide. I wonder if it makes sense to split that out into a separate document solely focused on how the CPython interpreter (parser, etc.) works? Then such definitions would make sense going there. Maybe it still makes sense to cover this in the devguide in the meantime?

gvanrossum · 2022-07-23T00:12:19Z

Oh, I always forget about the devguide! The section on the PEG parser is great (thanks Pablo) but the compiler design section seems outdated (there is no NEXT_BLOCK macro in compile.c any more, and it even still lists the long-dead peephole.c file).

I don't know whether those chapters would be served by moving them into yet another document -- we already have too many.

markshannon · 2022-07-23T12:21:53Z

If this was a classical assembly language, those pseudo-instructions would be macros

No, they are instructions that do not map directly to a execution-unit (or concrete instruction, using your terminology).
Maybe some of them could be defined using macros, but there is no way to define SETUP_FINALLY using a macro.

This is consistent with assemblers for hardware machines, which have pseudo-instructions that are built into the assembler, not defined as macros.

gvanrossum · 2022-07-23T15:47:42Z

You are correct.

markshannon added the docs Documentation in the Doc dir label Jun 30, 2022

markshannon assigned iritkatriel Jun 30, 2022

markshannon mentioned this issue Jul 22, 2022

Don't use EXTENDED_ARG_QUICK in unquickened code #95113

Closed

iritkatriel removed their assignment May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We need to be consistent in our use of instruction/codeunit/bytecode/opcode, etc. #94437

We need to be consistent in our use of instruction/codeunit/bytecode/opcode, etc. #94437

markshannon commented Jun 30, 2022 •

edited

Loading

arhadthedev commented Jun 30, 2022

gvanrossum commented Jul 22, 2022

warsaw commented Jul 22, 2022

gvanrossum commented Jul 23, 2022

markshannon commented Jul 23, 2022

gvanrossum commented Jul 23, 2022

We need to be consistent in our use of instruction/codeunit/bytecode/opcode, etc. #94437

We need to be consistent in our use of instruction/codeunit/bytecode/opcode, etc. #94437

Comments

markshannon commented Jun 30, 2022 • edited Loading

A glossary

Why do this?

Examples

arhadthedev commented Jun 30, 2022

gvanrossum commented Jul 22, 2022

warsaw commented Jul 22, 2022

gvanrossum commented Jul 23, 2022

markshannon commented Jul 23, 2022

gvanrossum commented Jul 23, 2022

markshannon commented Jun 30, 2022 •

edited

Loading