-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customized cost functions for faster contractions #171
Comments
Oh the other useful thing to note is that just taking the 'WRITE' cost - i.e. the sum of the sizes of all the intermediates - is the most relevant quantity for computing the space complexity of back propagating through a contraction (since roughly speaking all intermediates must be kept to compute gradients), which is often the more practical limit than actual time cost. |
Mostly for fun, here is the actual effect of changing [Cost and size of contractions are node and edge width+color respectively.] On the left is a |
This will be great to get in and seems to be an evolution of the different greedy approaches. We seem to have many choices that we wish to parametrize over time and give them simple names- an alternative approach could be leaning into the object-based count object such as: cost_fn = CostFunction(alpha=64, ...)
cost = cost_fn(indices, removed, ...) |
These are now available with #181. |
As part of some work on contracting some very large tensor networks we found that a good heuristic cost to optimize for is not just 'FLOPs' but also 'WRITE' - i.e. the size of the intermediate output tensors regardless of the cost to product it.
For example, an outer product with shapes like
(d, 1), (1, d)->(d, d)
has the same WRITE cost as matrix multiplication with shapes like(d, d), (d, d) -> (d, d)
despite having a FLOPs costd
times lower.If you take a cost as
FLOPS + alpha * WRITE
and optimize for this you can get some significant and fairly generic gains across hardware in terms of actual efficiency and contraction time. Relevant probably for e.g. #103.Here (you can ignore what the contraction is and the 'index sorting' bit), we have FLOPS and WRITE on top and actual contraction time on bottom on a single cpu thread:
You can see there's about a 5x speedup.
Here's the same for a larger contraction on an A100 (big fast gpu) with a speedup of 8x:
In both cases choosing
alpha=64
is very decent despite probably having different ratios of cpu cycles to memory speed etc etc. Generally there seems to be a window where neither the FLOPS or WRITE cost is compromised too much. EDIT: I should emphasize that the above contractions tend to produce very low arithmetic intensity contractions, so the speedups will not always be this large.I've had this in my own branch of
opt_einsum
for a while (just for thedp
optimizer, but which one can use to take any large contraction path/tree and iteratively 'reconfigure' subtrees of), but it would be nice to have it in the mainopt_einsum
and it also might be useful to others.Currently the interface is like:
@dgasmith, thoughts on including?
The text was updated successfully, but these errors were encountered: