Improve generated code for some cases #135
Replies: 12 comments
-
Hi, Here's a patch that does that: heeres/cpython@458a9a3 I haven't benchmarked it, but I'm pretty sure it should indeed be faster. Now:
Results in:
|
Beta Was this translation helpful? Give feedback.
-
Yup, would need to use a proper algorithm for sequentializing parallel copies, e.g. https://github.com/pfalcon/parcopy , to handle cases like:
|
Beta Was this translation helpful? Give feedback.
-
I think Paul pointed out a bug in your implementation. |
Beta Was this translation helpful? Give feedback.
-
What you're looking for in the second case is a superoptimizer, which is just a fancy name for a table-driven peephole optimizer where the table is produced from an exhaustive search. Because side-effects cannot be re-ordered, there is little point in handling sequences of more than 5 or 6 instructions. Such a table would include the mapping |
Beta Was this translation helpful? Give feedback.
-
As for the first case, I think the AST optimizer should skip tuples that are on the RHS of multiple assignments and leave them for the CFG optimizer. |
Beta Was this translation helpful? Give feedback.
-
Right, I seem to have ignored the comment about aliasing. Here's a slightly different version that circumvents the issue: heeres/cpython@49d7013 Now the elements are pushed on the stack in the right order first, and popped afterwards. This still takes out the UNPACK_SEQUENCE instruction, but doesn't have any stack advantage. However, for now it seems like an easier solution than performing a check whether reordering etc is necessary. Note that doing this at compile time also makes it trivial to check the sequence lengths so that a compile time error can be generated; which I think is a good feature: heeres/cpython@e171f23 |
Beta Was this translation helpful? Give feedback.
-
Another thing that could be improved is the heuristic to determine when to use
dis.dis(func):
It is dubious converting |
Beta Was this translation helpful? Give feedback.
-
I gave this a naive shot in isidentical/cpython@f8f8fce though it showed zero changes (
|
Beta Was this translation helpful? Give feedback.
-
I'm guessing none of these will move the needle on realistic benchmarks. But maybe occasionally one of these observations will allow us to remove some code. (E.g. the code in the AST that turns tuples into constants?) And perhaps the observations will help us when designing specialized instructions. (Could LOAD_METHOD/CALL_METHOD be done at that level instead of in the compiler? The number of opcodes is the same, the difference is that two opcodes that may be arbitrarily far apart need to be specialized together.) |
Beta Was this translation helpful? Give feedback.
-
@isidentical Could you turn your experiment into a PR? The reason it would be valuable, despite showing no immediate improvement is that it allows us to specialize |
Beta Was this translation helpful? Give feedback.
-
How would it suit better? As is or if it were to be applied to all globals (instead of just |
Beta Was this translation helpful? Give feedback.
-
Just those that are imported, at least for now. |
Beta Was this translation helpful? Give feedback.
-
I just noticed that the generated code for this fragment could be better.
The disassembly is:
The
UNPACK_SEQUENCE
opcode is pretty complex, the code would probably run faster if we generated it like this:Another example:
Disassembled:
Why not
?
That saves an opcode and a stack level. You'd have to reason about aliases, but for fast locals that's pretty simple, and threads can't interfere.
Beta Was this translation helpful? Give feedback.
All reactions