Improve generated code for some cases #135

gvanrossum · 2021-05-22T21:35:12Z

gvanrossum
May 22, 2021
Maintainer

I just noticed that the generated code for this fragment could be better.

def f():
  x, y = 101, 102

The disassembly is:

  2           0 LOAD_CONST               1 ((101, 102))
              2 UNPACK_SEQUENCE          2
              4 STORE_FAST               0 (x)
              6 STORE_FAST               1 (y)

The UNPACK_SEQUENCE opcode is pretty complex, the code would probably run faster if we generated it like this:

LOAD_CONST 1 (101)
STORE_FAST 0 (x)
LOAD_CONST 2 (102)
STORE_FAST 2 (y)

Another example:

def f(a, b):
  x, y = a, b

Disassembled:

  2           0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 ROT_TWO
              6 STORE_FAST               2 (x)
              8 STORE_FAST               3 (y)

Why not

LOAD_FAST 0 (a)
STORE_FAST 2(x)
LOAD_FAST 1 (b)
STORE_FAST 3 (y)

?

That saves an opcode and a stack level. You'd have to reason about aliases, but for fast locals that's pretty simple, and threads can't interfere.

heeres · 2021-05-28T12:49:41Z

heeres
May 28, 2021

Hi,

Here's a patch that does that: heeres/cpython@458a9a3

I haven't benchmarked it, but I'm pretty sure it should indeed be faster.

Now:

def f(x,y,z):
    a,b,c = x,y,z
    a,b = 1,2

Results in:

  4           0 LOAD_FAST                0 (x)
              2 STORE_FAST               3 (a)
              4 LOAD_FAST                1 (y)
              6 STORE_FAST               4 (b)
              8 LOAD_FAST                2 (z)
             10 STORE_FAST               5 (c)
  5          22 LOAD_CONST               2 (1)
             24 STORE_FAST               3 (a)
             26 LOAD_CONST               3 (2)
             28 STORE_FAST               4 (b)

0 replies

pfalcon · 2021-05-28T19:58:18Z

pfalcon
May 28, 2021

You'd have to reason about aliases

Yup, would need to use a proper algorithm for sequentializing parallel copies, e.g. https://github.com/pfalcon/parcopy , to handle cases like:

c, a, b = a, b, c
a, a, b = b, c, a

0 replies

gvanrossum · 2021-05-28T21:46:20Z

gvanrossum
May 28, 2021
Maintainer Author

Here's a patch that does that: heeres/cpython@458a9a3

I think Paul pointed out a bug in your implementation.

0 replies

markshannon · 2021-05-29T09:25:55Z

markshannon
May 29, 2021
Collaborator

What you're looking for in the second case is a superoptimizer, which is just a fancy name for a table-driven peephole optimizer where the table is produced from an exhaustive search.

Because side-effects cannot be re-ordered, there is little point in handling sequences of more than 5 or 6 instructions.
This make for a small table that can be generated in a few seconds or less, and only needs a few kilobytes.

Such a table would include the mapping LOAD_FAST [1]; ROT_TWO; STORE_FAST [2] => STORE_FAST [2]; LOAD_FAST [1]
which in your example would change LOAD_FAST a; ROT_TWO; STORE_FAST b to STORE_FAST b; LOAD_FAST a.

0 replies

markshannon · 2021-05-29T09:31:13Z

markshannon
May 29, 2021
Collaborator

As for the first case, I think the AST optimizer should skip tuples that are on the RHS of multiple assignments and leave them for the CFG optimizer.

0 replies

heeres · 2021-05-29T09:51:00Z

heeres
May 29, 2021

Right, I seem to have ignored the comment about aliasing. Here's a slightly different version that circumvents the issue: heeres/cpython@49d7013

Now the elements are pushed on the stack in the right order first, and popped afterwards. This still takes out the UNPACK_SEQUENCE instruction, but doesn't have any stack advantage. However, for now it seems like an easier solution than performing a check whether reordering etc is necessary.

Note that doing this at compile time also makes it trivial to check the sequence lengths so that a compile time error can be generated; which I think is a good feature: heeres/cpython@e171f23

0 replies

markshannon · 2021-05-29T13:03:32Z

markshannon
May 29, 2021
Collaborator

Another thing that could be improved is the heuristic to determine when to use LOAD_METHOD

import foo
def func():
    return foo.bar()

dis.dis(func):

  2           0 LOAD_GLOBAL              0 (foo)
              2 LOAD_METHOD              1 (bar)
              4 CALL_METHOD              0
              6 RETURN_VALUE

It is dubious converting LOAD_ATTR bar; CALL_FUNCTION into LOAD_METHOD bar; CALL_METHOD if foo is any global.
It is almost always bad if import foo exists in the global scope.

0 replies

isidentical · 2021-05-29T14:10:45Z

isidentical
May 29, 2021

Another thing that could be improved is the heuristic to determine when to use LOAD_METHOD

I gave this a naive shot in isidentical/cpython@f8f8fce though it showed zero changes (pyperf timeit "import foo; foo.bar(), with bar() being a function that returns None). (I only did it for the imported modules though, the implementation can be simplified if it were to be applied among all globals)

  3           0 LOAD_GLOBAL              0 (foo)
              2 LOAD_ATTR                1 (bar)
              4 CALL_FUNCTION            0
              6 RETURN_VALUE

0 replies

gvanrossum · 2021-05-29T21:00:27Z

gvanrossum
May 29, 2021
Maintainer Author

I'm guessing none of these will move the needle on realistic benchmarks. But maybe occasionally one of these observations will allow us to remove some code. (E.g. the code in the AST that turns tuples into constants?) And perhaps the observations will help us when designing specialized instructions. (Could LOAD_METHOD/CALL_METHOD be done at that level instead of in the compiler? The number of opcodes is the same, the difference is that two opcodes that may be arbitrarily far apart need to be specialized together.)

0 replies

markshannon · 2021-06-02T14:58:44Z

markshannon
Jun 2, 2021
Collaborator

@isidentical Could you turn your experiment into a PR?

The reason it would be valuable, despite showing no immediate improvement is that it allows us to specialize LOAD_ATTR for modules, without needing to specialize LOAD_METHOD for modules as well.

0 replies

isidentical · 2021-06-02T15:05:08Z

isidentical
Jun 2, 2021

@isidentical Could you turn your experiment into a PR?

How would it suit better? As is or if it were to be applied to all globals (instead of just import ...s)? (I assume the latter though just to confirm)

0 replies

markshannon · 2021-06-02T15:20:00Z

markshannon
Jun 2, 2021
Collaborator

Just those that are imported, at least for now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve generated code for some cases #135

{{title}}

Replies: 12 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improve generated code for some cases #135

gvanrossum May 22, 2021 Maintainer

Replies: 12 comments

heeres May 28, 2021

pfalcon May 28, 2021

gvanrossum May 28, 2021 Maintainer Author

markshannon May 29, 2021 Collaborator

markshannon May 29, 2021 Collaborator

heeres May 29, 2021

markshannon May 29, 2021 Collaborator

isidentical May 29, 2021

gvanrossum May 29, 2021 Maintainer Author

markshannon Jun 2, 2021 Collaborator

isidentical Jun 2, 2021

markshannon Jun 2, 2021 Collaborator

gvanrossum
May 22, 2021
Maintainer

heeres
May 28, 2021

pfalcon
May 28, 2021

gvanrossum
May 28, 2021
Maintainer Author

markshannon
May 29, 2021
Collaborator

markshannon
May 29, 2021
Collaborator

heeres
May 29, 2021

markshannon
May 29, 2021
Collaborator

isidentical
May 29, 2021

gvanrossum
May 29, 2021
Maintainer Author

markshannon
Jun 2, 2021
Collaborator

isidentical
Jun 2, 2021

markshannon
Jun 2, 2021
Collaborator