Reduce memory overhead from copy2d #13

EndlessReform · 2024-10-06T17:55:06Z

This PR fixes:

Now using the official KVCache implementation
For smaller models, the Candle repeat_interleave optimization with Tensor::cat is actually slower than the previous .unsqueeze().expand().reshape() even despite doing non-strided copies, due to the overhead of doing 8 sequential copy2ds + another to make it contiguous. I've reverted to the original, which improves speeds by another 20 tokens/sec. Further optimization is needed though, as this is down from 70µs to 25µs on a 4090 - still as much as the actual attention mechanism!
Moved rep pen calculation for slow codebooks to GPU, to avoid copying from GPU -> CPU for rep pen -> GPU for softmax and temp scaling -> sampling. It's lower-precision at BF16, but shaves off another ~100µs since we have to do this 8 times per timestep.
Fix RTF calculation for DualAR

For anyone else reading this: precomputed indexes aren't actually faster for interleaving the KV tensors, probably due to overhead.

Still behind the reference implementation with torch.compile: 160 tokens/sec on Nvidia 4090, vs 250 tokens/sec reported by maintainers. We're also getting increasingly CPU-bound due to kernel launch overhead.

Next steps:

Add flash attention support
Look into efficient RoPE and fused add+norm operators: they're making up a large percentage of runtime now
Potential CustomOp for interleaving, it's still taking too long
Experiment with CUDA graphs: looks like Candle may be adding support soon? Experimentations around cuda graphs huggingface/candle#2538

EndlessReform added 7 commits October 3, 2024 20:27

WIP Slow implementation of kv

33282f8

Initial microbenchmark

9b1f132

WIP isolate RoPE

3afb8ce

WIP Fix failures

7a9f263

WIP temporarily disable copy2d optimization

427fb76

Move rep pen to GPU

841e105

Final cleanup

2ce8937

EndlessReform self-assigned this Oct 6, 2024

EndlessReform added 2 commits October 6, 2024 13:28

Clean up code after testing

a28802e

Fix RTF calculations for DualAR

9aab3d8

EndlessReform merged commit f543dae into main Oct 6, 2024

EndlessReform deleted the reduce-attn-fragmentation branch October 6, 2024 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory overhead from copy2d #13

Reduce memory overhead from copy2d #13

EndlessReform commented Oct 6, 2024 •

edited

Loading

Reduce memory overhead from copy2d #13

Reduce memory overhead from copy2d #13

Conversation

EndlessReform commented Oct 6, 2024 • edited Loading

EndlessReform commented Oct 6, 2024 •

edited

Loading