Fix difference of LLM export for the direct vs paged cache #347

sogartar · 2024-10-28T16:38:12Z

Before work on unifying the cache interfaces there are some differences between sharded, direct and paged caches.
The direct cache uses a list of tensors for each transformer block while paged cache has on slab and paged sharded expect a list of shards.

renxida · 2024-10-28T17:52:31Z

thanks for doing this!

is this ready to merge? Would so love to have it in main asap - blocked by this and currently using some hacky solutions.

(Please make sure this works for
bs=1
bs=1,4
bs=4
)

renxida · 2024-10-28T17:56:47Z

rn when i try to run this i get

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/xidaren2/SHARK-Platform/sharktank/sharktank/examples/export_paged_llm_v1.py", line 325, in <module>
    main()
  File "/home/xidaren2/SHARK-Platform/sharktank/sharktank/examples/export_paged_llm_v1.py", line 307, in main
    generate_batch_prefill(bs)
  File "/home/xidaren2/SHARK-Platform/sharktank/sharktank/examples/export_paged_llm_v1.py", line 163, in generate_batch_prefill
    cache, cache_shard_dim, cache_dynamic_shapes, arg_affinities = setup_cache(
                                                                   ^^^^^^^^^^^^
  File "/home/xidaren2/SHARK-Platform/sharktank/sharktank/examples/export_paged_llm_v1.py", line 147, in setup_cache
    return torch.stack(cache_state), shard_dim, dynamic_shapes, arg_affinities
                                     ^^^^^^^^^
UnboundLocalError: cannot access local variable 'shard_dim' where it is not associated with a value

renxida · 2024-10-28T17:56:59Z

might need to manually test this because this file isn't exercised by the CI

sogartar · 2024-10-28T18:07:17Z

@renxida thank you for catching that. No matter how small of a change, I can always make a mistake. After the fix I tested the direct cache path also.

Before work on unifying the cache interfaces there are some differences between sharded, direct and paged caches. The direct cache uses a list of tensors for each transformer block while paged cache has on slab and paged sharded expect a list of shards.

renxida · 2024-10-28T18:45:36Z

Ack export works but now compile doesn't

Saving to '/home/xidaren2/xshortfin/goldens/exported_llama_model/model.mlir'

iree-compile /home/xidaren2/xshortfin/goldens/exported_llama_model/model.mlir --iree-hal-target-backends=rocm --iree-hip-target=gfx1100 -o /home/xidaren2/xshortfin/goldens/exported_llama_model/model.vmfb
/home/xidaren2/xshortfin/goldens/exported_llama_model/model.mlir:12158:12: error: 'tm_tensor.scatter' op mismatch in shape of indices and update value at dim#0
%357 = torch.aten.index_put %348, %356, %355, %false_114 : !torch.vtensor<[1,2048,32,100],f16>, !torch.list<optional>, !torch.vtensor<[32,100],f16>, !torch.bool -> !torch.vtensor<[1,2048,32,100],f16>
^
/home/xidaren2/xshortfin/goldens/exported_llama_model/model.mlir:12158:12: note: see current operation:
%677 = "tm_tensor.scatter"(%675, %676, %674) <{dimension_map = array<i64: 0, 1>, operandSegmentSizes = array<i32: 2, 1>, unique_indices = false}> ({
^bb0(%arg107: f16, %arg108: f16):
"tm_tensor.yield"(%arg107) : (f16) -> ()
}) : (tensor<32x1x1x32x100xf16>, tensor<1x2xi32>, tensor<1x2048x32x100xf16>) -> tensor<1x2048x32x100xf16>

dan-garvey · 2024-10-28T19:24:08Z

@renxida this is the issue I'm working on

sogartar requested review from rsuderman, dan-garvey and KyleHerndon October 28, 2024 16:38

KyleHerndon approved these changes Oct 28, 2024

View reviewed changes

sogartar added 3 commits October 28, 2024 14:07

Fx missing shard_dim var

7dc9b96

Remove nop line

62a037d

sogartar force-pushed the fix-cache-in-sharded-llama-export branch from e982607 to 62a037d Compare October 28, 2024 18:07

sogartar merged commit 98392d0 into nod-ai:main Oct 28, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix difference of LLM export for the direct vs paged cache #347

Fix difference of LLM export for the direct vs paged cache #347

sogartar commented Oct 28, 2024

renxida commented Oct 28, 2024 •

edited

Loading

renxida commented Oct 28, 2024

renxida commented Oct 28, 2024

sogartar commented Oct 28, 2024

renxida commented Oct 28, 2024

dan-garvey commented Oct 28, 2024

Fix difference of LLM export for the direct vs paged cache #347

Fix difference of LLM export for the direct vs paged cache #347

Conversation

sogartar commented Oct 28, 2024

renxida commented Oct 28, 2024 • edited Loading

renxida commented Oct 28, 2024

renxida commented Oct 28, 2024

sogartar commented Oct 28, 2024

renxida commented Oct 28, 2024

dan-garvey commented Oct 28, 2024

renxida commented Oct 28, 2024 •

edited

Loading