gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

mpage · 2024-09-10T22:53:56Z

This PR implements the foundational work necessary for making the specializing interpreter thread-safe in free-threaded builds and enables specialization for BINARY_OP as an end-to-end example. To enable future incremental work, specialization can now be toggled on a per-family basis. Subsequent PRs will enable specialization in free-threaded builds for the remaining families.

Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.

Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.

Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.

Issue: Make the specializing interpreter thread-safe in --disable-gil builds #115999

- Fix a few places where we were not using atomics to (de)instrument opcodes. - Fix a few places where we weren't using atomics to reset adaptive counters. - Remove some redundant non-atomic resets of adaptive counters that presumably snuck as merge artifacts of python#118064 and python#117144 landing close together.

…entation using atomics

Read the opcode atomically, the interpreter may be specializing it

Include/cpython/code.h

Python/ceval_macros.h

mpage · 2024-10-22T00:02:52Z

@markshannon - Would you take a look at this, please?

markshannon · 2024-10-23T16:02:32Z

I'm still concerned about not counting the tlbc memory blocks in the refleaks test.

Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks?

mpage · 2024-10-24T04:49:06Z

!buildbot nogil refleak

bedevere-bot · 2024-10-24T04:49:09Z

🤖 New build scheduled with the buildbot fleet by @mpage for commit 07f9140 🤖

The command will test the builders whose names match following regular expression: nogil refleak

The builders matched are:

AMD64 CentOS9 NoGIL Refleaks PR
AMD64 Fedora Rawhide NoGIL refleaks PR
aarch64 Fedora Rawhide NoGIL refleaks PR
PPC64LE Fedora Rawhide NoGIL refleaks PR

mpage · 2024-10-24T05:26:55Z

I'm still concerned about not counting the tlbc memory blocks in the refleaks test.

Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks?

@markshannon - That would work, but I opted for clearing the cached TLBC for threads that aren't currently in use when we clear other internal caches. This should still catch leaks, doesn't require modifying refleaks.py, and is the same approach we use for tier2. Please have a look.

Lib/test/test_sys.py

markshannon · 2024-10-29T12:25:53Z

Lib/test/test_sys.py

+            # code objects is a large fraction of the total number of
+            # references, this can cause the total number of allocated
+            # blocks to exceed the total number of references.
+            if not support.Py_GIL_DISABLED:


Now that we can free the unused tlbcs, can we replace this with sys._clear_internal_caches()?

Unfortunately, no. It seems to be very sensitive to which kinds of objects are on the heap as well as the number of non reference counted allocations (blocks) per object. With the introduction of TLBC there is at least one additional block allocated per code object that is not reference counted, the _PyCodeArray, which is present even if we free the unused TLBCs. Its presence is enough to trigger the assertion.

This assertion feels pretty brittle and I'd be in favor of removing it, but that's probably worth doing in a separate PR.

Maybe replace it with a more meaningful test rather than remove it. But in another PR.

markshannon

Looks good.

One question. Can we prefix the test for leaking blocks with sys._clear_internal_caches() instead of making it conditional on not using free-threading?

mpage · 2024-10-29T16:46:03Z

One question. Can we prefix the test for leaking blocks with sys._clear_internal_caches() instead of making it conditional on not using free-threading?

@markshannon - Unfortunately that doesn't help. See my reply inline.

markshannon

I still have concerns about memory use, but we can iterate on that in subsequent PRs.

We are quickening LOAD_CONST now.

mpage · 2024-11-04T19:10:45Z

JIT failures appear on main and are unrelated to this PR

encukou · 2024-11-05T09:23:17Z

After this PR was merged, test_gdb started failing; see e.g. https://buildbot.python.org/#/builders/506/builds/9149
Do you think it's possible to fix this in a day?

Yhg1s · 2024-11-05T11:53:30Z

Looks like the problem is only in --enable-shared builds, and it's because we're now looking up _PyInterpreterFrame too early (before the .so file is loaded). I'll have a fix in a few minutes.

Yhg1s · 2024-11-05T12:11:30Z

PR #126440 should fix the failure.

mpage added 30 commits September 10, 2024 13:24

Assign threads indices into bytecode copies

776a1e1

Replace most usage of PyCode_CODE

2b40870

Get bytecode copying working

344d7ad

Refactor remove_tools

f203d00

Refactor remove_line_tools

82b456a

Instrument thread-local bytecode

b021704

Use locks for instrumentation

aea69c5

Add ifdef guards for each specialization family

552277d

Specialize BINARY_OP

50a6089

Limit the amount of memory consumed by bytecode copies

3f1d941

Make thread-local bytecode limits user configurable

7d2eb27

Make branch taken recording thread-safe

e3b367a

Lock thread-local bytecode when specializing

b2375bf

Load bytecode on RESUME_CHECK

2707f8e

Load tlbc on generator.throw()

3fdcb28

Use tlbc instead of thread_local_bytecode

4a55ce5

Use tlbc everywhere

8b3ff60

Explicitly manage tlbc state

862afa1

Refactor API for fetching tlbc

0b4d952

Add unit tests

7795e99

Fix initconfig in default build

693a4cc

Fix instrumentation in default build

b43531e

Synchronize bytecode modifications between specialization and instrum…

9025f43

…entation using atomics

Add a high-level comment

c44c7d9

Fix unused variable warning in default build

e2a6656

Fix test_config in free-threaded builds

e6513d1

Fix formatting

a18396f

Remove comment

81fe1a2

Fix data race in _PyInstruction_GetLength

837645e

Read the opcode atomically, the interpreter may be specializing it

Merge branch 'main' into pythongh-115999-thread-local-bytecode

b16ae5f

Yhg1s approved these changes Oct 19, 2024

View reviewed changes

Include/cpython/code.h Show resolved Hide resolved

Python/ceval_macros.h Show resolved Hide resolved

bedevere-app bot added awaiting merge and removed awaiting change review labels Oct 19, 2024

mpage added 3 commits October 23, 2024 11:36

Merge branch 'main' into pythongh-115999-thread-local-bytecode

176b24e

Clear TLBC when other caches are cleared

c107495

Remove _get_tlbc_blocks

07f9140

Yhg1s reviewed Oct 25, 2024

View reviewed changes

Lib/test/test_sys.py Show resolved Hide resolved

markshannon reviewed Oct 29, 2024

View reviewed changes

markshannon self-requested a review October 29, 2024 16:54

markshannon approved these changes Oct 29, 2024

View reviewed changes

mpage added 4 commits October 30, 2024 09:55

Merge branch 'main' into pythongh-115999-thread-local-bytecode

4cbe237

Rename _PyCode_InitCounters back to _PyCode_Quicken

38ff315

We are quickening LOAD_CONST now.

Merge branch 'main' into pythongh-115999-thread-local-bytecode

338f7e5

Merge branch 'main' into pythongh-115999-thread-local-bytecode

bcd1bb2

mpage merged commit 2e95c5b into python:main Nov 4, 2024
51 of 57 checks passed

bedevere-app bot removed the awaiting merge label Nov 4, 2024

mpage deleted the gh-115999-thread-local-bytecode branch November 4, 2024 19:14

Yhg1s mentioned this pull request Nov 6, 2024

gh-115999: Add free-threaded specialization for COMPARE_OP #126410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

mpage commented Sep 10, 2024 •

edited

Loading

mpage commented Oct 22, 2024

markshannon commented Oct 23, 2024

mpage commented Oct 24, 2024

bedevere-bot commented Oct 24, 2024

mpage commented Oct 24, 2024

markshannon Oct 29, 2024 •

edited

Loading

mpage Oct 29, 2024 •

edited

Loading

markshannon Oct 29, 2024

markshannon left a comment •

edited

Loading

mpage commented Oct 29, 2024 •

edited

Loading

markshannon left a comment

mpage commented Nov 4, 2024

encukou commented Nov 5, 2024

Yhg1s commented Nov 5, 2024

Yhg1s commented Nov 5, 2024

gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP #123926

gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP #123926

Conversation

mpage commented Sep 10, 2024 • edited Loading

mpage commented Oct 22, 2024

markshannon commented Oct 23, 2024

mpage commented Oct 24, 2024

bedevere-bot commented Oct 24, 2024

mpage commented Oct 24, 2024

markshannon Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

mpage Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

markshannon Oct 29, 2024

Choose a reason for hiding this comment

markshannon left a comment • edited Loading

Choose a reason for hiding this comment

mpage commented Oct 29, 2024 • edited Loading

markshannon left a comment

Choose a reason for hiding this comment

mpage commented Nov 4, 2024

encukou commented Nov 5, 2024

Yhg1s commented Nov 5, 2024

Yhg1s commented Nov 5, 2024

gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

gh-115999: Implement thread-local bytecode and enable specialization for `BINARY_OP` #123926

mpage commented Sep 10, 2024 •

edited

Loading

markshannon Oct 29, 2024 •

edited

Loading

mpage Oct 29, 2024 •

edited

Loading

markshannon left a comment •

edited

Loading

mpage commented Oct 29, 2024 •

edited

Loading