Efficient storage of code object debug information #324
Replies: 8 comments 4 replies
-
What would this be used for? Can it not be derived from the offset and the next instruction's offset? |
Beta Was this translation helpful? Give feedback.
-
It could be derived, but it makes code using the table simpler if each entry is self contained. |
Beta Was this translation helpful? Give feedback.
-
Let's do this in two steps:
|
Beta Was this translation helpful? Give feedback.
-
Here's a possible scheme for the compact format, taking advantage of patterns in most Python code. The data is organized into 4bit nybbles. The first nybble describes the format. The following zero or more nybbles contain necessary data.
Variable length unsigned ints are encoded by splitting into 3 bit chunks and adding 8 to all but the last. |
Beta Was this translation helpful? Give feedback.
-
We need to be careful about the unpacking part because calling PyFrame_GetLineNumber is already one of the most heavy parts of profilers, especially since 3.10. The format looks convoluted but sensible. Further than that, we should get some numbers on how much this format will compress the existing format and the difference in decoding speed. I guess that since most cases will be in code 0-7 and those require little processing it may be fast. On the other hand, one thing that will happen with this format is that to get the current line number you need to parse all other debugging information (the column numbers), which may slow down the unpacking compared to only have to deal with separate structs for line numbers and column numbers. |
Beta Was this translation helpful? Give feedback.
-
@pablogsal The idea is to expand the table only once, when first needed. Scanning it repeated would be rather slow, as you point out. I think the following are true:
Therefore, it makes sense (to me) to have compact debug info for all code objects, which when first used is expanded into a form suitable for random access. |
Beta Was this translation helpful? Give feedback.
-
Although the above implementation is very compact, it is difficult to scan, and needs to be expanded, which needs extra code. I think we would be better doing something like the current line number table, which can be scanned backwards as well as forwards. To be able to scan backwards we need an additional bit per entry as a marker. Using 1 bit out of 4 is excessive, so we are back to using bytes rather than nybbles. Given the smallest viable entry (if location info is present) is two bytes, then we have a few bits spare to add the instruction delta, making the table self contained.
The variable sized int encoding requires all bytes start with 0, and one bit is needed to signal extension, so are composed of 6 bit chunks. |
Beta Was this translation helpful? Give feedback.
-
Assuming 2/3rds of instructions are handled by either the "short" or "indent" codes, and the remainder by the "long" form, |
Beta Was this translation helpful? Give feedback.
-
Code objects include a lot of information, much of it only used for tracing and debugging of various sorts.
It would be good if that information used less space when we aren't using it.
Classifying the fields of a code object we have (ignoring computed attributes and int fields):
Needed for normal execution:
Debug info, all bytes objects:
locals()
Currently this information uses about 8 bytes per instruction, in large part due to having column table entries for each cache entry.
A single table
I would like to replace the four debug objects with a single bytes object with the following format:
[0] 0 if compact, 1 if expanded.
[1 - 3] len(co_localspluskinds) (16 million max local variables seems sufficient)
[4 - 4+len] co_localspluskinds
[...] Location info in compact or expanded form.
The expanded form should be an easily searchable table; one fixed-width entry per instruction.
EXTENDED_ARG
s)EXTENDED_ARG
s prefix and inline cache)RESUME
). 1 bitI think we can impose some reasonable limits on the size of the fields, even in the expanded form.
E.g. We know the instruction length cannot exceed 10 or so, so we can fit it into 4 bits. It might make sense, then, to limit the start offset to 28 bits and fit both into 32 bits.
Then we can fit the whole entry into 96 bits:
This should save memory, as most tables would remain in their compact form, and it saves 3 pointers and 3 bytes object headers per code object.
It should also speed up tracing and debugging, as once expanded, the table is quickly searchable.
Beta Was this translation helpful? Give feedback.
All reactions