Skip to content

Bytecode compilation output depends on order of files compiled #129724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
konstin opened this issue Feb 6, 2025 · 3 comments
Open

Bytecode compilation output depends on order of files compiled #129724

konstin opened this issue Feb 6, 2025 · 3 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@konstin
Copy link

konstin commented Feb 6, 2025

Bug report

Bug description:

This is minimal reproduction of this downstream bug report: astral-sh/uv#10619

The output of compileall.compile_file depends on the order in which the files are compiled. This means compilation is non-deterministic if builds are distributed over a process pool.

This becomes a problem when building docker images, where you usually bytecode compile ahead of time for faster startup, and where the hash of the image depends on all files in the image, including the .pyc files.

Specifically, the output of

a = {"foo", 2, 3}

def f():
    b = {"foo", 2, 3}

is different if we previously compiled another file with

import foo

Reproducer script:

#!/bin/bash

set -e

script=$(cat << EOF
import compileall
import sys

for path in sys.argv[1:]:
    compileall.compile_file(path)
EOF
)

cat << EOF > a.py
import foo
EOF

cat << EOF > b.py
a = {"foo", 2, 3}

def f():
    b = {"foo", 2, 3}
EOF

# Both files
rm -rf __pycache__
python3.14 -c "$script" a.py b.py
sha256sum __pycache__/b.cpython-314.pyc

# For debugging
cp __pycache__/b.cpython-314.pyc b1.cpython-314.pyc

# Single file only
rm -rf __pycache__
python3.14 -c "$script" b.py
sha256sum __pycache__/b.cpython-314.pyc

# For debugging
cp __pycache__/b.cpython-314.pyc b2.cpython-314.pyc

This is caused be different refcounts in the marshalled files:

import marshal
import sys

with open("b1.cpython-313.pyc", "rb") as f:
  f.read(16)  # Skip header
  pyc1 = marshal.load(f)

with open("b2.cpython-313.pyc", "rb") as f:
  f.read(16)  # Skip header
  pyc2 = marshal.load(f)

print(sys.getrefcount(pyc1.co_consts[0]))
print(sys.getrefcount(pyc2.co_consts[0]))

This prints 2 and 3.

The original report is from 3.13, i've reproduced it with 3.14.0a4. It happens at least on linux and windows.

CPython versions tested on:

3.14

Operating systems tested on:

Linux

@konstin konstin added the type-bug An unexpected behavior, bug, or error label Feb 6, 2025
@encukou encukou added type-feature A feature request or enhancement interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed type-bug An unexpected behavior, bug, or error labels Feb 6, 2025
@encukou
Copy link
Member

encukou commented Feb 6, 2025

Yes, bytecode compilation is non-deterministic. See also this issue: #78274 (comment)

A PR to make it deterministic without sacrificing performance would be welcome. You can also use an external tool to post-process the pyc.

@konstin
Copy link
Author

konstin commented Feb 6, 2025

Is it valid to modify the serialized refcounts (to some specific value)? I'm having trouble following what the marshalled refcounts mean and when they'd be DECREF'd.

@encukou
Copy link
Member

encukou commented Feb 6, 2025

They're not refcounts. FLAG_REF gets added if an object might appear later in the file, in which case it can later be referenced by TYPE_REF (using an index to the array of all objects with FLAG_REF).

One solution would be to flag all objects (inflating the array and the indices), or doing a second pass over the pyc (wasteful if not needed, so CPython doesn't do it)... or something like redesigning the serialization format entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants