-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt_einsum thinks the largest intermediate will be small, but torch.einsum allocates 156 GiB #133
Comments
This is a bit of an edge case where it assumes that all What this is really telling you is that there are no contraction paths available that satisfy the memory footprint constraint that you have provided. Playing with it you need to provide scratch space on the order of 1e7 for this to work. oe.contract_path(equation, *batches_of_small_cores, big_core, optimize="greedy", memory_limit=1e7)[1]
I am unsure if we will be in the business of supporting specific backends at this level. Its something that we could potentially do, but would require us keeping a much closer eye on what others are doing in their |
Generally speaking, the task of finding a contraction path is to break the contraction into pairwise contractions, exponentially reducing the time complexity at some space cost - usually not insignificantly higher intermediate memory. As you have found, introducing the In general I'd say due to the complexities of path finding that the Generally the best way to reduce memory for a contraction is 'slicing' (#95, #125) certain indices once a good path has been found, which I've just checked works well for this case: import torch
import opt_einsum as oe
device = "cuda"
big_core = torch.randn(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, dtype=torch.float64, device=device)
batches_of_small_cores = [torch.randn(512, 25, 25, 2, dtype=torch.float64, device=device) for _ in range(16)]
equation = "αβγi,αβγj,αβγk,αβγl,αβγm,αβγn,αβγo,αβγp,αβγq,αβγr,αβγs,αβγt,αβγu,αβγv,αβγw,αβγx,ijklmnopqrstuvwxω->αβγω"
path, info = oe.contract_path(equation, *batches_of_small_cores, big_core, optimize="auto-hq") Note Now we find best indices to explicitly sum over: import cotengra as ctg
sf = ctg.SliceFinder(info, target_size=2**26)
inds_to_slice, cost_of_slicing = sf.search()
cost_of_slicing.size # the new largest intermediate
# 40960000.0
cost_of_slicing.overhead # theoretical 'slowdown'
1.0273594262607766 Finally actually perform the contraction: import tqdm as tqdm
sc = sf.SlicedContractor([*batches_of_small_cores, big_core])
result = sum(sc.contract_slice(i) for i in tqdm.trange(sc.nslices))
# 100%|██████████| 512/512 [00:55<00:00, 9.30it/s] Maybe at some point the |
@dgasmith Actually, with the path you found the problem appears as well. To be clear, I know how to calculate this particular contraction somewhat efficiently. I guess I'll have to construct the path on my own. -- For the sake of giving you more knowledge about how |
Is there documentation on how pytorch calculates its intermediates and forms the computation? It seems that To echo @jcmgray you can slice along your large 512 index to obtain a linear decrease in memory footprint, we still will not be able to tell you exactly how much memory it will take however. |
@philip-bl If you find a good contraction path that can contract this better than e.g. |
I want to perform a contraction described by the following code
I am predicting memory problems so I ask opt_einsum to do
memory_limit="max_input"
. Well, it turns out that it doesn't work. opt_einsum reports that this contraction barely allocates anything:but opt_einsum is wrong. This path contains one operation - one
torch.einsum
, but thistorch.einsum
tries to allocate 156.25 GiB of memory:To be clear, if I change
device = "cuda"
todevice = "cpu"
, approximately the same amount of memory is allocated. Also, no backpropagation and no gradient tracking is happening in this code snippet.My guess of what is happening:
torch.einsum
is a dumb function which doesn't allocate any memory other than the memory for the output.torch.einsum
allocates intermediate tensors, and I don't understand what logic it uses to choose how to allocate them.If this not fixable, I suggest updating the documentation saying that iwth pytorch,
memory_limit
is not reliable at all.The text was updated successfully, but these errors were encountered: