-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP
#123926
Conversation
- Fix a few places where we were not using atomics to (de)instrument opcodes. - Fix a few places where we weren't using atomics to reset adaptive counters. - Remove some redundant non-atomic resets of adaptive counters that presumably snuck as merge artifacts of python#118064 and python#117144 landing close together.
…entation using atomics
Read the opcode atomically, the interpreter may be specializing it
@markshannon - Would you take a look at this, please? |
I'm still concerned about not counting the tlbc memory blocks in the refleaks test. Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks? |
!buildbot nogil refleak |
🤖 New build scheduled with the buildbot fleet by @mpage for commit 07f9140 🤖 The command will test the builders whose names match following regular expression: The builders matched are:
|
@markshannon - That would work, but I opted for clearing the cached TLBC for threads that aren't currently in use when we clear other internal caches. This should still catch leaks, doesn't require modifying |
# code objects is a large fraction of the total number of | ||
# references, this can cause the total number of allocated | ||
# blocks to exceed the total number of references. | ||
if not support.Py_GIL_DISABLED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we can free the unused tlbcs, can we replace this with sys._clear_internal_caches()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, no. It seems to be very sensitive to which kinds of objects are on the heap as well as the number of non reference counted allocations (blocks) per object. With the introduction of TLBC there is at least one additional block allocated per code object that is not reference counted, the _PyCodeArray, which is present even if we free the unused TLBCs. Its presence is enough to trigger the assertion.
This assertion feels pretty brittle and I'd be in favor of removing it, but that's probably worth doing in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replace it with a more meaningful test rather than remove it. But in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
One question. Can we prefix the test for leaking blocks with sys._clear_internal_caches()
instead of making it conditional on not using free-threading?
@markshannon - Unfortunately that doesn't help. See my reply inline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have concerns about memory use, but we can iterate on that in subsequent PRs.
We are quickening LOAD_CONST now.
JIT failures appear on main and are unrelated to this PR |
After this PR was merged, |
Looks like the problem is only in --enable-shared builds, and it's because we're now looking up _PyInterpreterFrame too early (before the .so file is loaded). I'll have a fix in a few minutes. |
PR #126440 should fix the failure. |
This PR implements the foundational work necessary for making the specializing interpreter thread-safe in free-threaded builds and enables specialization for
BINARY_OP
as an end-to-end example. To enable future incremental work, specialization can now be toggled on a per-family basis. Subsequent PRs will enable specialization in free-threaded builds for the remaining families.Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the
co_tlbc
array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in allco_tlbc
arrays at thread creation and release the index at thread destruction. The first entry in everyco_tlbc
array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.Thread-local bytecode can be disabled at runtime by providing either
-X tlbc=0
orPYTHON_TLBC=0
. Disabling thread-local bytecode also disables specialization.Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.
--disable-gil
builds #115999