Skip to content

We need to be consistent in our use of instruction/codeunit/bytecode/opcode, etc. #94437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
markshannon opened this issue Jun 30, 2022 · 6 comments
Labels
docs Documentation in the Doc dir

Comments

@markshannon
Copy link
Member

markshannon commented Jun 30, 2022

Documentation

We use the terms opcode, bytecode, instruction, and codeunit, in the code, comments and documentation.

However we aren't consistent, nor do we define those terms properly anywhere. The best docs are in dis.rst which is the wrong place for them.

A glossary

First of all we want some sort of glossary like this:

  • Instruction. The element of execution used by the front end to describe execution. All instructions have a name. Most, but not all, also have an operand
  • Execution-Unit: These can be considered to be the "real instructions" used by the interpreter. The assembler converts each instruction into zero or more execution-units. Instructions that are converted to anything but one execution-unit with the same name are called "pseudo-instructions".
  • Code-Unit: A pair of bytes consisting of an opcode and oparg. In the bytecode, an execution-unit is represented by one or more codeunits.
  • Bytecode: A sequence of codeunits that represents the code of a function, class or module (or other code entity).

Representation of instruction at runtime:
The assembler converts each instruction to zero or more execution-units, and each of those are converted to one or more code-units
An execution-unit is composed of:

  • Zero or more operand extensions. These are code units whose opcode == EXTENDED_ARG and whose oparg is 8 of the high bits of the instruction's operand.
  • One core code unit, whose opcode represents the name of the instruction, and whose oparg == (opcode & 255)
  • Zero or more cache entries. The exact number depends on the execution-unit name and is exactly determined by that name.

Although the bytecode, co.co_code, is presented as a sequence of bytes, it should be viewed as a sequence of codeunits, with the opcode preceding the oparg. The dis module will disassemble bytecode to a list of codeunits.

Why do this?

Doing this will expose inconsistencies in our terminology and tools and allow us to consider better tooling in the future.

For example, shouldn't dis output a list of instructions, not codeunits?

Could we support an assembler, allowing backwards compatible assembly code?
We could convert a list of 3.10 instructions to 3.11 bytecode. At the instruction level, they aren't so different, even though the bytecode is quite different.

The set of names is infinite, allowing us more flexibility to add new instructions, and support old ones.

Examples

The BINARY_ADD instruction is also an execution-unit in 3.10, but could be a pseudo-instruction in 3.11+
Likewise SETUP_FINALLY. The difference is that the 3.11 front-end emits SETUP_FINALLY, but not BINARY_ADD.

*Instruction: LOAD_METHOD "spam"
*Execution unit: LOAD_ATTR 515
*Code units: EXTENDED_ARG 2 LOAD_ATTR 3 CACHE 0*6

@markshannon markshannon added the docs Documentation in the Doc dir label Jun 30, 2022
@arhadthedev
Copy link
Member

I have a feeling that all three instruction, execution unit and code unit can be seen as instructions, just of different tiers. The tiers can be named, for example:

  • stressing their fineness: generalized instructions, specialized instructions, finetuned instructions
  • or going LLVM-esque conveyor of frontend-middleend-backend: frontend instructions, backend instruction, and primitive instruction).

@gvanrossum
Copy link
Member

I like instruction, pseudo-instruction, code-unit, and bytecode, but I'm not excited about execution-unit (too long, not quite self-explanatory enough, and I don't think we've been using that term). Maybe the latter could be "concrete instructions"? Longer, but more self-explanatory.

If this was a classical assembly language, those pseudo-instructions would be macros, and I wouldn't object to reusing that term instead of pseudo- (or virtual-?) instructions.

In Mark's definitions, are cache entries code-units or not? The definition of bytecode seems to imply they are, but the definition of code-unit seems to imply they're not (cache entries don't have an opcode and oparg).

Agreed that this documentation doesn't belong in dis.rst.

@warsaw
Copy link
Member

warsaw commented Jul 22, 2022

Agreed that this documentation doesn't belong in dis.rst.

There's a lot of good information about how the CPython interpreter works in the devguide. I wonder if it makes sense to split that out into a separate document solely focused on how the CPython interpreter (parser, etc.) works? Then such definitions would make sense going there. Maybe it still makes sense to cover this in the devguide in the meantime?

@gvanrossum
Copy link
Member

Oh, I always forget about the devguide! The section on the PEG parser is great (thanks Pablo) but the compiler design section seems outdated (there is no NEXT_BLOCK macro in compile.c any more, and it even still lists the long-dead peephole.c file).

I don't know whether those chapters would be served by moving them into yet another document -- we already have too many.

@markshannon
Copy link
Member Author

If this was a classical assembly language, those pseudo-instructions would be macros

No, they are instructions that do not map directly to a execution-unit (or concrete instruction, using your terminology).
Maybe some of them could be defined using macros, but there is no way to define SETUP_FINALLY using a macro.

This is consistent with assemblers for hardware machines, which have pseudo-instructions that are built into the assembler, not defined as macros.

@gvanrossum
Copy link
Member

You are correct.

@iritkatriel iritkatriel removed their assignment May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
None yet
Development

No branches or pull requests

5 participants