Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wave Dec 2024 Release #250

Open
1 of 24 tasks
harsh-nod opened this issue Nov 4, 2024 · 0 comments
Open
1 of 24 tasks

Wave Dec 2024 Release #250

harsh-nod opened this issue Nov 4, 2024 · 0 comments

Comments

@harsh-nod
Copy link
Contributor

harsh-nod commented Nov 4, 2024

This issue lists all feature requests and improvements slated for the Nov 2024 Tkw release.

Flash Attention Performance is highest priority

  • FP8 Functionality & FP16 Performance Improvement
  • Adjusting k-width to maximize reads from shared memory and align layouts between 2 matmuls
  • Scheduling
  • Packed Shuffles
  • Implement FP8 Attention Kernel
  • Scaling of Q has to happen after Q @ K
  • Linear offset has to be added (linear offset = 1.0 / max representable number in fp format)
  • Causal mask (addition of triangular matrix 0s and -infinity)
  • Dynamic dimensions for sequence length
  • Paged Attention using vector.gathers
  • Extend Attention (split-k vs warp reduction)
  • Prefill Attention
  • Decode Attention (M = 1, with dynamic)
  • Update Paper
  • Unaligned shapes for GEMMs
  • Debugger support (add breakpoints and inspect stack on GPU)
  • Profiling support
  • Ensure that mappings modify the index sequence
  • IGEMM Performance Results
  • GEMM Non-temporal loads
  • GEMM + SiLU fusion kernel
  • MoE Kernel
  • Buffer loads to load K directly to shared memory
  • Buffer loads for masking
  • Understand scheduling + multi-buffering in tensile to be able to implement it in wave

================================================

Week 1 (Nov 8th)

  • Scheduling

Week 2(Nov 15)
Ivan

  • Adding support for using tensors from the kernel in mapping for reads and writes
    Harsh
  • Create a FA page table dataset for Ivan to test his PR on
  • Create a harness for SGLANG grok / llama where we can test baseline perf and add our kernels and see perf (with Sai)
  • Write a decode attention kernel
  • Unaligned sequence length & Unaligned head dim
    Stan
  • Adjusting k-width to maximize reads from shared memory and align layouts between 2 matmuls
  • Scheduling meeting with Giuseppe to show kernel and help him iterate
  • 15th meeting with quantization team showing the FP8 kernel
    =========================================================================================
    Unassigned
  • Getting kernels with hipblaslt where we can turn knobs and relate knobs to output kernels
  • Packed Shuffles
  • Dynamic & aligned attention fp16 (M & K2 not specified)

Week 3(Nov 22)

  • Identifying which knobs represent multi-buffering and investigating strategy for multi-buffering

Week 4(Nov 29)

  • Start drafting implementation strategy for mimicking multi buffering
@harsh-nod harsh-nod changed the title Dec 2024 Release Wave Dec 2024 Release Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant