Skip to content

We need to be consistent in our use of instruction/codeunit/bytecode/opcode, etc. #94437

Open
@markshannon

Description

@markshannon

Documentation

We use the terms opcode, bytecode, instruction, and codeunit, in the code, comments and documentation.

However we aren't consistent, nor do we define those terms properly anywhere. The best docs are in dis.rst which is the wrong place for them.

A glossary

First of all we want some sort of glossary like this:

  • Instruction. The element of execution used by the front end to describe execution. All instructions have a name. Most, but not all, also have an operand
  • Execution-Unit: These can be considered to be the "real instructions" used by the interpreter. The assembler converts each instruction into zero or more execution-units. Instructions that are converted to anything but one execution-unit with the same name are called "pseudo-instructions".
  • Code-Unit: A pair of bytes consisting of an opcode and oparg. In the bytecode, an execution-unit is represented by one or more codeunits.
  • Bytecode: A sequence of codeunits that represents the code of a function, class or module (or other code entity).

Representation of instruction at runtime:
The assembler converts each instruction to zero or more execution-units, and each of those are converted to one or more code-units
An execution-unit is composed of:

  • Zero or more operand extensions. These are code units whose opcode == EXTENDED_ARG and whose oparg is 8 of the high bits of the instruction's operand.
  • One core code unit, whose opcode represents the name of the instruction, and whose oparg == (opcode & 255)
  • Zero or more cache entries. The exact number depends on the execution-unit name and is exactly determined by that name.

Although the bytecode, co.co_code, is presented as a sequence of bytes, it should be viewed as a sequence of codeunits, with the opcode preceding the oparg. The dis module will disassemble bytecode to a list of codeunits.

Why do this?

Doing this will expose inconsistencies in our terminology and tools and allow us to consider better tooling in the future.

For example, shouldn't dis output a list of instructions, not codeunits?

Could we support an assembler, allowing backwards compatible assembly code?
We could convert a list of 3.10 instructions to 3.11 bytecode. At the instruction level, they aren't so different, even though the bytecode is quite different.

The set of names is infinite, allowing us more flexibility to add new instructions, and support old ones.

Examples

The BINARY_ADD instruction is also an execution-unit in 3.10, but could be a pseudo-instruction in 3.11+
Likewise SETUP_FINALLY. The difference is that the 3.11 front-end emits SETUP_FINALLY, but not BINARY_ADD.

*Instruction: LOAD_METHOD "spam"
*Execution unit: LOAD_ATTR 515
*Code units: EXTENDED_ARG 2 LOAD_ATTR 3 CACHE 0*6

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation in the Doc dir

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions