Releases: PygmalionAI/aphrodite-engine
Releases · PygmalionAI/aphrodite-engine
v0.4.9
Another hotfix to follow v0.4.8. Fixed issues with GGUF, and included the hadamard tensors back in the wheel.
Full Changelog: v0.4.8...v0.4.9
v0.4.8
Full Changelog: v0.4.7...v0.4.8
Quick hotfix to v0.4.7, as it wasn't including LoRAs in the wheel.
v0.4.7
What's Changed
Lots of new additions, after a long time.
New features and additions
- Dynamic Temperature. (@StefanGliga)
- Switch from Ray to NCCL for control-plane communications. Massive speedup for parallelism
- Support for prefix caching. Needs to be sent as
prefix + prompt
. Not in API servers yet - Support for S-LoRA. Basically, load multiple LoRAs and pick one for inference. Not in the API servers yet
- Speed up AWQ throughput with new kernels - close to GPTQ speeds now
- Custom all-reduce kernels for parallelism. Massive improvements to throughput - parallel setups are now faster than single-gpu setups even at low batch sizes
- Add Context-Free Grammar support. EBNF format is currently supported
- Add GGUF support
- Add QuIP# support
- Add Marlin support
- Add
/metrics
for Kobold server - Add Quadratic Sampler
- Add Deepseek-MoE support with fused kernels
- Add Grafana + Prometheus production monitoring support
Bug Fixes and Small Optimizations
- Fix temperature always being set to 1
- Logprobs would crash the server if it contained NaN or -inf (@miku448)
- Switch to deques in the scheduler instead of lists. Reduces complexity from quadratic to linear
- Fix eager_mode performance by not excessively padding for every iteration
- Optimize memory usage with CUDA graph by tying
max_num_seqs
with the captured batch size. Lower the value to lower memory usage - Both safetensors and pytorch bins were being downloaded
- Fix crash with
max_tokens=None
- Fix multi-gpu on WSL
- Fix some outputs returning token_id=0 at high concurrency (@50h100a)
v0.4.6
What's Changed
- Set CPU Affinity: Electric Boogaloo V2 by @KaraKaraWitch in #187
- chore: backlog 1 by @AlpinDale in #191
- feat: support GPTQ 2, 3, and 8bit quants by @AlpinDale in #181
- feat: FP8 KV Cache (ENG-4) by @AlpinDale in #185
- feat: tokenizer endpoint for OpenAI API by @AlpinDale in #195
- feat: rejection sampler by @AlpinDale in #197
- feat: better mixtral parallelism by @AlpinDale in #193
- fix: triton compile error by @AlpinDale in #200
- feat: reduce sampler overhead by making it less blocking by @AlpinDale in #198
- fix: test units by @AlpinDale in #201
- merge branch 'dev' into 'main' by @AlpinDale in #203
- feat: bump cuda to 12.1 by @AlpinDale in #205
- bump version to 0.4.6 by @AlpinDale in #204
New Contributors
- @KaraKaraWitch made their first contribution in #187
Full Changelog: v0.4.5...v0.4.6
v0.4.5
What's Changed
Important
Version 0.4.4 was skipped.
Quite a few changes this time around, most notably:
- Implement DeciLM by @AlpinDale in #158
- Support prompt logprobs by @AlpinDale in #162
- Support safetensors for Mixtral along with expert parallelism for better multi-gpu by @AlpinDale in #167
- Implement CUDA graphs for better multi-GPU and optimizing smaller models by @AlpinDale in #172
- Fix peak memory profiling to allow higher gmu values by @AlpinDale in #166
- Restore compatibility with Python 3.8 and 3.9 by @g4rg in #170
- Lazily import model classes to avoid import overhead by @AlpinDale in #165
- Add RoPE scaling support for Mixtral models by @g4rg in #174
- Make OpenAI API keys optional by @AlpinDale in #176
Full Changelog: v0.4.4...v0.4.5
v0.4.3
This is a big release! We've had many new and exciting changes.
What's New
- Mixtral 8x7B support by @AlpinDale in #155
- add ROCm support for MI200-300 GPUs by @AlpinDale in #95
- implement fused Add RMSNorm kernels by @AlpinDale in #125
- add SqueezeLLM support by @AlpinDale in #140
- add chat templates for the OpenAI endpoint by @AlpinDale in #138
- speed up compilation times by 2 to 3x by @AlpinDale in #130
- support Phi 1.5 models by @AlpinDale in #121
NOTE: You'll need to run pip install megablocks
if you're using the wheels.
New Contributors
Full Changelog: v0.4.2...v0.4.3
v0.4.2
What's Changed
- fix: correct auto ntk scaling_factor for 4k ctx case by @sandwichdoge in #101
- fix: cpu memory limit detection for containers by @g4rg in #103
- feat: yi support by @AlpinDale in #104
- fix: docker port by @Krisseck in #105
- feat: min_p by @StefanGliga in #106
- chore: api keys for OAI server by @AlpinDale in #107
New Contributors
- @sandwichdoge made their first contribution in #101
- @Krisseck made their first contribution in #105
Full Changelog: v0.4.1...v0.4.2
v0.4.1
v0.4
What's Changed
- Make entrypoint executable by @city-unit in #83
- Correct Conda Env Creation in Dockerfile by @city-unit in #82
- feat: prompt logprobs and batched samplers by @AlpinDale in #77
- feat: add mistral support for GPTQ by @AlpinDale in #86
- feat: finish up tests and workflows by @AlpinDale in #87
- feat: flattened 1D tensor -> 2D tensor by @AlpinDale in #85
- chore: reformats by @AlpinDale in #90
- fix: pylint complaints by @AlpinDale in #91
- fix: remove unnecessary lines by @g4rg in #81
- fix: sync CPU delay in sampler by @AlpinDale in #93
- New Mirostatv2 implementation by @50h100a in #96
- feat: spaces between special tokens by @AlpinDale in #94
- chore: clean up endpoints by @AlpinDale in #98
- feat: add exllamav2 for GPTQ by @AlpinDale in #99
- fix: force v2 for ctxlen larger than 8192 by @AlpinDale in #100
New Contributors
- @city-unit made their first contribution in #83
Full Changelog: v0.3.7...v0.4
v0.3.7
What's Changed
- fix: prompt processing overhead introduced by #66 by @AlpinDale in #71
- fix: launch AWQ kernels on the current CUDAStream by @AlpinDale in #75
- Added
min_tokens
and reimplementedignore_eos
using a new logit processor by @50h100a in #70 - feat: add PagedAttention V2 kernels by @AlpinDale in #76
- feat:Enable banning tokens by @StefanGliga in #80
Full Changelog: v0.3.6...v0.3.7