[sharktank] Export Attention IRs for LLMs #175

archana-ramalingam · 2024-09-09T20:57:35Z

Export Attention IRs for paged causal and non causal LLMs.

sogartar · 2024-10-02T18:34:01Z

We can't use directly torch ops with our tensor types.
https://github.com/nod-ai/SHARK-Platform/actions/runs/11148274191/job/30984526213?pr=175#step:6:123
In this example torch.concat would need to be substituted with sharktank.ops.cat.
At some instances we may be lacking the equivalent op or it may have no implementation for certain tensor types, like ReplicatedTensor. In this case we have to add it. For some tensor types this may be difficult, but in most cases it is easy. I have been meaning to make a general implementation when all tensors are replicated across devices, but have not gotten to it. I will add the cat overload for the replicated tensor.

sharktank/sharktank/layers/kv_cache.py

Add replicated and default variants for ops gather, cat and repeat.

sharktank/sharktank/export_layer/export_paged_attention.py

sharktank/sharktank/layers/kv_cache.py

…sharktank into attention_microbenchmark

sharktank/tests/layers/sharded_paged_kv_cache_test.py

archana-ramalingam added 6 commits August 21, 2024 19:36

Add Direct cache attention export

1b4444e

Add direct cache attention

7869c91

Update paged and direct cache attention exports

8dc2dbd

Update attention export script

37b28f7

Fuse decode scatters

2f5dd66

Add scatter fusion for prefill and decode

42d9fb8

archana-ramalingam requested a review from rsuderman September 9, 2024 20:58

archana-ramalingam and others added 12 commits September 10, 2024 14:01

Cleanup debug statements

48ee669

Cleanup

4418ea2

Rename

6fcded5

Add pre-commit hooks updated files

9ccf41c

Update .pre-commit-config.yaml

affebd2

Merge branch 'main' into attention_microbenchmark

ca73207

Delete unpaged attention export script

79d4d2c

Move attention export script to export_layer folder

9c51aa7

Update conflicting MOE params

28f9515

Set attention_mask to None

6ca1a7f

Merge branch 'main' into attention_microbenchmark

b0d9fe2

Merge branch 'main' into attention_microbenchmark

aba713d

sogartar reviewed Oct 2, 2024

View reviewed changes

sharktank/sharktank/layers/kv_cache.py Outdated Show resolved Hide resolved

sogartar added 2 commits October 2, 2024 16:08

Handle sharded case and fix sharded KV cache test

0605ddf

Add replicated and default variants for ops gather, cat and repeat.

In ShardedLlamaTest cange seq_lens type to torch.int64

5375807

sogartar reviewed Oct 3, 2024

View reviewed changes

sharktank/sharktank/export_layer/export_paged_attention.py Show resolved Hide resolved

sharktank/sharktank/layers/kv_cache.py Outdated Show resolved Hide resolved

archana-ramalingam added 4 commits October 3, 2024 11:59

Remove unused cache_partitions_list

9dde0de

Add new line at end of file

bbaaf1d

Merge branch 'attention_microbenchmark' of https://github.com/nod-ai/…

ace92fc

…sharktank into attention_microbenchmark

Add missing dim

8413de5

sogartar approved these changes Oct 3, 2024

View reviewed changes

rsuderman approved these changes Oct 3, 2024

View reviewed changes

sharktank/tests/layers/sharded_paged_kv_cache_test.py Show resolved Hide resolved

Merge branch 'main' into attention_microbenchmark

3018608

archana-ramalingam merged commit 5300b05 into main Oct 4, 2024
8 checks passed

archana-ramalingam deleted the attention_microbenchmark branch October 4, 2024 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sharktank] Export Attention IRs for LLMs #175

[sharktank] Export Attention IRs for LLMs #175

archana-ramalingam commented Sep 9, 2024

sogartar commented Oct 2, 2024

[sharktank] Export Attention IRs for LLMs #175

[sharktank] Export Attention IRs for LLMs #175

Conversation

archana-ramalingam commented Sep 9, 2024

sogartar commented Oct 2, 2024