You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue lists all feature requests and improvements slated for the Nov 2024 Tkw release.
Flash Attention Performance is highest priority
FP8 Functionality & FP16 Performance Improvement
Adjusting k-width to maximize reads from shared memory and align layouts between 2 matmuls
Scheduling
Packed Shuffles
Implement FP8 Attention Kernel
Scaling of Q has to happen after Q @ K
Linear offset has to be added (linear offset = 1.0 / max representable number in fp format)
Causal mask (addition of triangular matrix 0s and -infinity)
Dynamic dimensions for sequence length
Paged Attention using vector.gathers
Extend Attention (split-k vs warp reduction)
Prefill Attention
Decode Attention (M = 1, with dynamic)
Update Paper
Unaligned shapes for GEMMs
Debugger support (add breakpoints and inspect stack on GPU)
Profiling support
Ensure that mappings modify the index sequence
IGEMM Performance Results
GEMM Non-temporal loads
GEMM + SiLU fusion kernel
MoE Kernel
Buffer loads to load K directly to shared memory
Buffer loads for masking
Understand scheduling + multi-buffering in tensile to be able to implement it in wave
================================================
Week 1 (Nov 8th)
Scheduling
Week 2(Nov 15)
Ivan
Adding support for using tensors from the kernel in mapping for reads and writes
Harsh
Create a FA page table dataset for Ivan to test his PR on
Create a harness for SGLANG grok / llama where we can test baseline perf and add our kernels and see perf (with Sai)
Write a decode attention kernel
Unaligned sequence length & Unaligned head dim
Stan
Adjusting k-width to maximize reads from shared memory and align layouts between 2 matmuls
Scheduling meeting with Giuseppe to show kernel and help him iterate
15th meeting with quantization team showing the FP8 kernel
=========================================================================================
Unassigned
Getting kernels with hipblaslt where we can turn knobs and relate knobs to output kernels
Packed Shuffles
Dynamic & aligned attention fp16 (M & K2 not specified)
Week 3(Nov 22)
Identifying which knobs represent multi-buffering and investigating strategy for multi-buffering
Week 4(Nov 29)
Start drafting implementation strategy for mimicking multi buffering
The text was updated successfully, but these errors were encountered:
This issue lists all feature requests and improvements slated for the Nov 2024 Tkw release.
Flash Attention Performance is highest priority
================================================
Week 1 (Nov 8th)
Week 2(Nov 15)
Ivan
Harsh
Stan
=========================================================================================
Unassigned
Week 3(Nov 22)
Week 4(Nov 29)
The text was updated successfully, but these errors were encountered: