v0.6.4
What's Changed
- frontend: enable kobold api by default by @AlpinDale in #803
- feat: add serviceinfo endpoint by @AlpinDale in #807
- feat: update to serviceinfo v0.2 by @AlpinDale in #808
- Mask dynatemp using min/max, rather than exp by @50h100a in #813
- fix: temperature issues by @50h100a in #814
- fix: --max-seq-len-to-capture arg by @AlpinDale in #818
- [IMPORTANT] updating test units by @AlpinDale in #769
- fix: tokenization api test by @AlpinDale in #821
- feat: add chat method for LLM class by @AlpinDale in #822
- feat: support chunked prefill with LoRA by @AlpinDale in #823
- SPMD optimizations by @AlpinDale in #824
- fix: sampler test with new transformers version by @AlpinDale in #826
- feat: add cuda sampling kernels for top_k and top_p by @AlpinDale in #828
- feat: add metrics for prefix cache hit rate by @AlpinDale in #829
- fix: unbound tokenizer error by @AlpinDale in #830
- feat: multi-step scheduling by @AlpinDale in #831
- feat: Add DRY (Do not Repeat Yourself) sampling by @selalipop in #827
- feat: add no_repeat_ngram sampler by @AlpinDale in #832
- feat: add skew sampling by @AlpinDale in #834
- fix: hidden states handling in batch expansion for spec decoding by @AlpinDale in #839
- chore: refactor executor classes for easier inheritance by @AlpinDale in #840
- fix: latency and serving benchmarks by @AlpinDale in #841
- feat: Machete Kernels for Hopper GPUs by @AlpinDale in #842
- feat: add sampler_priorty by @AlpinDale in #837
- fix: disable awq_marlin override for awq models by @AlpinDale in #843
- chore: bump mistral_common to 1.5.0 by @AlpinDale in #844
- ci: bump version to 0.6.4 by @AlpinDale in #845
New Contributors
- @dependabot made their first contribution in #796
- @selalipop made their first contribution in #827
Full Changelog: v0.6.3...v0.6.4