Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes:
.unsqueeze().expand().reshape()
even despite doing non-strided copies, due to the overhead of doing 8 sequential copy2ds + another to make it contiguous. I've reverted to the original, which improves speeds by another 20 tokens/sec. Further optimization is needed though, as this is down from 70µs to 25µs on a 4090 - still as much as the actual attention mechanism!For anyone else reading this: precomputed indexes aren't actually faster for interleaving the KV tensors, probably due to overhead.
Still behind the reference implementation with torch.compile: 160 tokens/sec on Nvidia 4090, vs 250 tokens/sec reported by maintainers. We're also getting increasingly CPU-bound due to kernel launch overhead.
Next steps: