03 Sep 07:11

github-actions

54d6d87

v0.6.0

v0.6.0 - "Kept you waiting, huh?" Edition

What's Changed

Fix quants installation on ROCM by @Naomiusearch in #469
chore: add contribution guidelines + Code of Conduct by @AlpinDale in #507
Remove $ from the shell code blocks in README by @matthusby in #538
[0.6.0] Release Candidate by @AlpinDale in #481

New Contributors

@matthusby made their first contribution in #538

Full Changelog: v0.5.3...v0.6.0

Contributors

matthusby, AlpinDale, and Naomiusearch

Assets 3

11 May 22:34

github-actions

v0.5.3

5ee79a1

v0.5.3

What's Changed

A new release, one that took too long again. We have some cool new features, however.

ExllamaV2 tensor parallel: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.
Support for Command-R+
Support for DBRX
Support for Llama-3
Support for Qwen 2 MoE
min_tokens sampling param: You can now set a minimum amount of tokens to generate.
Fused MoE for AWQ and GPTQ quants: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.
CMake build system: Slightly faster, much cleaner builds.
CPU support: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.
Speculative Decoding: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).
Chunked Prefill: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via --enable-chunked-prefill) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache.
Context Shift reworked: Context shift finally works now. Enable it with --context-shift and Aphrodite will cache processed prompts and re-use them.
FP8 E4M3 KV Cache: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.
Auto-truncation in API: The API server can now optionally left-truncate your prompts. Simply pass truncate_prompt_tokens=1024 to truncate any prompt larger than 1024 tokens.
Support for Llava vision models: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.
LM Format Enforcer: You can now use LMFE for guided generations.
EETQ Quantization: EETQ support has been added - a SOTA 8bit quantization method.
Arbitrary GGUF model support: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.
Aphrodite CLI app: You no longer have to type python -m aphrodite.... Simply type aphrodite run meta-llama/Meta-Llama-3-8B to get started. Pass extra flags as normal.
Sharded GGUF support: You can now load sharded GGUF models. Pre-conversion needed.
NVIDIA P100/GP100 support: Support has been restored.

Thanks to all the new contributors!

Full Changelog: v0.5.2...v0.5.3

Assets 10

16 Mar 22:50

github-actions

v0.5.2

ed225f5

v0.5.2

What's Changed

A few fixes and new additions:

Support for CohereAI's command-r model: Currently, GGUF is unsupported. You can load the base model with --load-in-4bit or --load-in-smooth if you have an RTX 20xx series (or sm_75).
Fix an issue where some GPU blocks were missing. This should give a significant boost to how much context you can use.
Fix logprobs when -inf with some models.

Full Changelog: v0.5.1...v0.5.2

Assets 10

15 Mar 02:55

github-actions

v0.5.1

594fe81

v0.5.1

What's Changed

feat(openai): Apply chat template for GGUF loader by @drummerv in #312
Calculate total memory usage. by @sgsdxzy in #316
chore: add new iMatrix quants by @AlpinDale in #320
fix: optimize AQLM dequantization by @AlpinDale in #325

New Contributors

@drummerv made their first contribution in #312

Full Changelog: v0.5.0...v0.5.1

Contributors

sgsdxzy, drummerv, and AlpinDale

Assets 10

11 Mar 16:37

github-actions

v0.5.0

434dc19

v0.5.0

Aphrodite Engine, Release v0.5.0: It's Quantin' Time Edition

It's been over a month since our last release. Below is re-written using Opus from my crude hand-written release notes.

New Features

Exllamav2 Quantization: Exllamav2 quantization has been added, although it's currently limited to a single GPU due to kernel constraints.
On-the-Fly Quantization: With the help of bitsandbytes and smoothquant+, we now support on-the-fly quantization of FP16 models. Use --load-in-4bit for lightning-fast 4-bit quantization with smoothquant+, --load-in-smooth for 8-bit quantization using smoothquant+, and --load-in-8bit for 8-bit quantization using the bitsandbytes library (note: this option is quite slow). --load-in-4bit needs Ampere GPUs and above, the other two need Turing and above.
Marlin Quantization: Marlin quantization support has arrived, promising improved speeds at high batch sizes. Convert your GPTQ models to Marlin, but keep in mind that they must be 4-bit, with a group_size of -1 or 128, and act_order set to False.
AQLM Quantization: We now support the state-of-the-art 2-bit quantization scheme, AQLM. Please note that both quantization and inference are extremely slow with this method. Quantizing llama-2 70b on 8x A100s reportedly takes 12 days, and on a single 3090 it takes 70 seconds to reach the prompt processing phase. Use this option with caution, as the wait process may cause the engine to timeout (set to 60 seconds).
INT8 KV Cache Quantization: In addition to fp8_e5m2, we now support INT8 KV Cache. Unlike FP8, it doesn't speed up the throughput (it stays the same), but should offer higher quality, due to the calibration process. Uses the smoothquant algorithm for the quantization.
Implicit GGUF Model Conversion: Simply point the --model flag to your GGUF file, and it will work out of the box. Be aware that this process requires a considerable amount of RAM to load the model, convert tensors to a PyTorch state_dict, and then load them. Plan accordingly or convert first if you're short on RAM.
LoRA support in the API: The API now supports loading and inferencing LoRAs! Please refer to the wiki for detailed instructions.
New Model Support: We've added support for a wide range of models, including OPT, Baichuan, Bloom, ChatGLM, Falcon, Gemma, GPT2, GPT Bigcode, InternLM2, MPT, OLMo, Qwen, Qwen2, and StableLM.
Fused Mixtral MoE: Mixtral models (FP16 only) now utilize tensor parallelism with fused kernels, replacing the previous expert parallelism approach. Quantized Mixtrals still have this limitation, but we plan to address it by the next release.
Fused Top-K Kernels for MoE: This improvement benefits Mixtral and DeepSeek-MoE models by accelerating the top-k operation using custom CUDA kernels instead of torch.topk.
Enhanced OpenAI Endpoint: The OpenAI endpoint has been refactored, introducing JSON and Regex schemas, as well as a detokenization endpoint.
LoRA Support for Mixtral Models: You can now use LoRA with Mixtral models.
Fine-Grained Seeds: Introduce randomness to your requests with per-request seeds.
Context Shift: We have a naive context shifting mechanism. While it's not as effective as we'd like, it's available for experimentation purposes. Enable it using the --context-shift flag.
Cubic Sampling: Building upon quadratic sampling's smoothing_factor, we now support smoothing_curve.
Navi AMD GPU Support: GPUs like the 7900 XTX are now supported, although still experimental and requiring significant compilation efforts due to xformers.
Kobold API Deprecation: The Kobold API has been deprecated and merged into the OpenAI API. Launch the OpenAI API using the --launch-kobold-api flag. Please note that Kobold routes are not protected with the API key.
LoRA Support for Quantized Models: We've added LoRA support for GPTQ and AWQ quantized models.
Logging Experience Overhaul: We've revamped the logging experience using a custom loguru class, inspired by tabbyAPI's recent changes.
Informative Logging Metrics: Logging has been enhanced to display model memory usage and reduce display bloat, among other improvements.
Ray Worker Health Check: The engine now performs health checks on Ray workers, promptly reporting any silent failures or timeouts.

Bug Fixes

Resolved an issue where smoothing_factor would break at high batch sizes.
Fixed a bug with LoRA vocab embeddings.
Addressed the missing CUDA suffixes in the version number (e.g., 0.5.0+cu118). The suffix is now appended when using a CUDA version other than 12.1.
Dynatemp has been split into min/max from range. The Kobold endpoint still accepts a range as input.
Fixed worker initialization in WSL.
Removed the accidental inclusion of FP8 kernels in the ROCm build process.
The EOS token is now removed by default from the output, unrelated to the API.
Resolved memory leaks caused by NCCL CUDA graphs.
Improved garbage collection for LoRAs.
Optimized the execution of embedded runtime scripts.

Upcoming Improvements

Here's a sneak peek at what we're working on for the next release:

Investigating tensor parallelism with Exllamav2
Addressing the issue of missing GPU blocks for GGUF and Exl2 (we already have a fix for FP16, GPTQ, and AWQ)

New Contributors

@anon998 made their first contribution in #253
@sgsdxzy made their first contribution in #256
@SwadicalRag made their first contribution in #268
@thomas-xin made their first contribution in #260
@StefanDanielSchwarz made their first contribution in #264
@Pyroserenus made their first contribution in #296
@Autumnlight02 made their first contribution in #288

Full Changelog: v0.4.9...v0.5.0

Contributors

sgsdxzy, SwadicalRag, and 4 other contributors

Assets 10

03 Feb 16:40

AlpinDale

v0.4.9

3163839

v0.4.9

Another hotfix to follow v0.4.8. Fixed issues with GGUF, and included the hadamard tensors back in the wheel.

Full Changelog: v0.4.8...v0.4.9

Assets 6

03 Feb 08:43

github-actions

v0.4.8

a1836a4

v0.4.8

Full Changelog: v0.4.7...v0.4.8

Quick hotfix to v0.4.7, as it wasn't including LoRAs in the wheel.

Assets 6

03 Feb 06:28

github-actions

v0.4.7

8da2be0

v0.4.7

What's Changed

Lots of new additions, after a long time.

New features and additions

Dynamic Temperature. (@StefanGliga)
Switch from Ray to NCCL for control-plane communications. Massive speedup for parallelism
Support for prefix caching. Needs to be sent as prefix + prompt. Not in API servers yet
Support for S-LoRA. Basically, load multiple LoRAs and pick one for inference. Not in the API servers yet
Speed up AWQ throughput with new kernels - close to GPTQ speeds now
Custom all-reduce kernels for parallelism. Massive improvements to throughput - parallel setups are now faster than single-gpu setups even at low batch sizes
Add Context-Free Grammar support. EBNF format is currently supported
Add GGUF support
Add QuIP# support
Add Marlin support
Add /metrics for Kobold server
Add Quadratic Sampler
Add Deepseek-MoE support with fused kernels
Add Grafana + Prometheus production monitoring support

Bug Fixes and Small Optimizations

Fix temperature always being set to 1
Logprobs would crash the server if it contained NaN or -inf (@miku448)
Switch to deques in the scheduler instead of lists. Reduces complexity from quadratic to linear
Fix eager_mode performance by not excessively padding for every iteration
Optimize memory usage with CUDA graph by tying max_num_seqs with the captured batch size. Lower the value to lower memory usage
Both safetensors and pytorch bins were being downloaded
Fix crash with max_tokens=None
Fix multi-gpu on WSL
Fix some outputs returning token_id=0 at high concurrency (@50h100a)

Contributors

StefanGliga, miku448, and 50h100a

Assets 6

14 Jan 08:03

github-actions

v0.4.6

9f77f35

v0.4.6

What's Changed

Set CPU Affinity: Electric Boogaloo V2 by @KaraKaraWitch in #187
chore: backlog 1 by @AlpinDale in #191
feat: support GPTQ 2, 3, and 8bit quants by @AlpinDale in #181
feat: FP8 KV Cache (ENG-4) by @AlpinDale in #185
feat: tokenizer endpoint for OpenAI API by @AlpinDale in #195
feat: rejection sampler by @AlpinDale in #197
feat: better mixtral parallelism by @AlpinDale in #193
fix: triton compile error by @AlpinDale in #200
feat: reduce sampler overhead by making it less blocking by @AlpinDale in #198
fix: test units by @AlpinDale in #201
merge branch 'dev' into 'main' by @AlpinDale in #203
feat: bump cuda to 12.1 by @AlpinDale in #205
bump version to 0.4.6 by @AlpinDale in #204

New Contributors

@KaraKaraWitch made their first contribution in #187

Full Changelog: v0.4.5...v0.4.6

Contributors

AlpinDale and KaraKaraWitch

Assets 6

19 Dec 15:42

github-actions

v0.4.5

11af9b7

v0.4.5

What's Changed

Important

Version 0.4.4 was skipped.

Quite a few changes this time around, most notably:

Implement DeciLM by @AlpinDale in #158
Support prompt logprobs by @AlpinDale in #162
Support safetensors for Mixtral along with expert parallelism for better multi-gpu by @AlpinDale in #167
Implement CUDA graphs for better multi-GPU and optimizing smaller models by @AlpinDale in #172
Fix peak memory profiling to allow higher gmu values by @AlpinDale in #166
Restore compatibility with Python 3.8 and 3.9 by @g4rg in #170
Lazily import model classes to avoid import overhead by @AlpinDale in #165
Add RoPE scaling support for Mixtral models by @g4rg in #174
Make OpenAI API keys optional by @AlpinDale in #176

Full Changelog: v0.4.4...v0.4.5

Contributors

AlpinDale and g4rg

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 - "Kept you waiting, huh?" Edition

What's Changed

New Contributors

Contributors

What's Changed

What's Changed

What's Changed

New Contributors

Contributors

Aphrodite Engine, Release v0.5.0: It's Quantin' Time Edition

New Features

Bug Fixes

Upcoming Improvements

New Contributors

Contributors

What's Changed

New features and additions

Bug Fixes and Small Optimizations

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Releases: PygmalionAI/aphrodite-engine

v0.6.0

v0.6.0 - "Kept you waiting, huh?" Edition

What's Changed

New Contributors

Contributors

v0.5.3

What's Changed

v0.5.2

What's Changed

v0.5.1

What's Changed

New Contributors

Contributors

v0.5.0

Aphrodite Engine, Release v0.5.0: It's Quantin' Time Edition

New Features

Bug Fixes

Upcoming Improvements

New Contributors

Contributors

v0.4.9

v0.4.8

v0.4.7

What's Changed

New features and additions

Bug Fixes and Small Optimizations

Contributors

v0.4.6

What's Changed

New Contributors

Contributors

v0.4.5

What's Changed

Contributors