[draft] use flash_attention from cuda-free #12

sijiac · 2024-09-24T18:54:41Z

By switching to use the attention from the cuda-free repo, the Triton attention now works well for the kernels repo

The missing part of the attention kernel of kernels repo is it doesn’t support the decoding case, where the length of Q and the length of K is different for the same batch

python3 -m main llama_chat_completion --profile=False --benchmark=False --ckpt_dir="/home/sijiac/models/Meta-Llama-3-8B-Instruct/" --tokenizer_path="/home/sijiac/models/Meta-Llama-3-8B-Instruct/tokenizer.model" --use_triton=True

adamomainz · 2024-09-24T19:00:01Z

can you run with benchmarking turned on and see the difference? would be curious to see the attention specific latency here :) in that case you dont need to specify use_triton since it will run with both cases

use flash_attention from cuda-free

4a1a609

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] use flash_attention from cuda-free #12

[draft] use flash_attention from cuda-free #12

sijiac commented Sep 24, 2024 •

edited

Loading

adamomainz commented Sep 24, 2024

[draft] use flash_attention from cuda-free #12

Are you sure you want to change the base?

[draft] use flash_attention from cuda-free #12

Conversation

sijiac commented Sep 24, 2024 • edited Loading

adamomainz commented Sep 24, 2024

sijiac commented Sep 24, 2024 •

edited

Loading