Releases: PygmalionAI/aphrodite-engine
v0.6.2.post1
What's Changed
- fix: kobold api for horde by @AlpinDale in #763
- Fix for a crash from token bans by @Pyroserenus in #764
- Modified throughput benchmark to allow --max-num-seqs by @Pyroserenus in #770
- Simplify construction of sampling_metadata by @50h100a in #766
- Add OLMoE by @fizzAI in #772
- feat: ministral support by @AlpinDale in #776
- Make amd usable by @Naomiusearch in #775
- docker: apply AMD patch in the dockerfile by @AlpinDale in #777
- fix: demote skip_special_tokens assertion to logger error by @AlpinDale in #778
- ci: bump version to 0.6.2.post1 by @AlpinDale in #779
New Contributors
Full Changelog: v0.6.2...v0.6.2.post1
v0.6.2
What's Changed
- feat: FP8 quantization support for AMD ROCm by @AlpinDale in #729
- feat: add experts_int8 support by @AlpinDale in #730
- chore: move update_flash_attn_metadata to attn backend by @AlpinDale in #731
- chore: register lora functions as torch ops by @AlpinDale in #732
- feat: dynamo support for ScalarType by @AlpinDale in #733
- fix: types in AQLM and GGUF for dynamo support by @AlpinDale in #736
- fix:
custom_ar
check by @AlpinDale in #737 - fix: clear engine ref in RPC server by @AlpinDale in #738
- fix: use nvml to get consistent device names by @AlpinDale in #739
- feat: add Exaone model support by @shing100 in #743
- fix: minor bug fixes & clean-ups by @AlpinDale in #744
- chore: refactor
MultiModalConfig
initialization and profiling by @AlpinDale in #745 - chore: various TPU fixes and optimizations by @AlpinDale in #746
- fix: metrics endpoint with RPC server by @AlpinDale in #747
- chore: refactor llama3 rope by @AlpinDale in #748
- feat: add XTC Sampling by @AlpinDale in #740
- ci: fix dep install using pnpm by @ahme-dev in #749
- ci: fix docs deployment by @ahme-dev in #750
- chore: re-enable custom token bans by @AlpinDale in #751
- feat: bring back dynatemp by @AlpinDale in #754
- feat: quant_llm support by @AlpinDale in #755
- fix: add pandas to requirements by @AlpinDale in #756
- docs: update readme and quant docs by @AlpinDale in #757
- ci: bump version to 0.6.2 by @AlpinDale in #758
New Contributors
Full Changelog: v0.6.1.post1...v0.6.2
v0.6.1.post1
What's Changed
- chore: register custom torch ops for flash-attn and flashinfer by @AlpinDale in #724
- feat: launch API server with uvloop by @AlpinDale in #725
- chore: fix return statement in
Detokenizer.decode_sequence_inplace
by @AlpinDale in #727 - Fix tensor parallelism, libcudart path for some versions of pytorch by @miku448 in #726
- ci: bump to 0.6.1.post1 by @AlpinDale in #728
Full Changelog: v0.6.1...v0.6.1.post1
v0.6.1
Aphrodite Engine - v0.6.1
What's Changed
- ci: exclude cu118 from build and add py_limited_api by @AlpinDale in #639
- fix: better async request cancellation by @AlpinDale in #641
- fix: gracefully handle missing chat template by @AlpinDale in #642
- chore: deduplicate nvlink check to cuda platform by @AlpinDale in #643
- fix: hardcoded float16 in embedding mode check by @AlpinDale in #645
- quadratic sampling: separate diff from logits to filter out NaNs. by @50h100a in #644
- fix: RSLoRA support by @AlpinDale in #647
- feat: introduce
BaseAphroditeParameter
by @AlpinDale in #646 - fix: move zeromq rpc frontend to IPC instead of TCP by @AlpinDale in #652
- fix: input processor in internvl2 by @AlpinDale in #653
- fix: multiprocessing timeout by @AlpinDale in #654
- fix: GPTQ/AWQ on Colab by @AlpinDale in #655
- fix: make
merge_async_iterators.is_cancelled()
optional by @AlpinDale in #656 - fix: flashinfer outputs by @AlpinDale in #657
- fix: max_num_batched_tokens should not be limited for lora by @AlpinDale in #658
- fix: lora with pipeline parallel by @AlpinDale in #659
- fix: kill api server when pinging dead engine by @AlpinDale in #660
- fix:
get_num_blocks_touched
logic by @AlpinDale in #661 - chore: update the env.py script and the bug report template by @AlpinDale in #662
- feat: add INT8 W8A16 quant for TPU by @AlpinDale in #663
- feat: allow serving encoder-decoder models in the API server by @AlpinDale in #664
- fix: deps with TPU dockerfile by @AlpinDale in #665
- optimization: reduce end-to-end overhead from python obj allocation by @AlpinDale in #666
- fix: minor adjustments to scheduler and block manager by @AlpinDale in #667
- feat: enable using fp8 kv and prefix caching with chunked prefill by @AlpinDale in #668
- fix: mlpspeculator with padded vocab by @AlpinDale in #669
- feat: option to apply temperature scaling last by @AlpinDale in #670
- chore: decouple
should_modify_greedy_probs_inplace
by @AlpinDale in #671 - chore: better stream termination in async engine by @AlpinDale in #672
- chore: mamba cache single buffer by @AlpinDale in #673
- feat: mamba model support by @AlpinDale in #674
- fix: reinit procedure in
ModelInputForGPUBuilder
by @AlpinDale in #675 - feat: embeddings support for batched OAI endpoint by @AlpinDale in #676
- fix: fp8 checkpoints with fused linear modules by @AlpinDale in #677
- feat: add numpy implementation of
compute_slot_mapping
by @AlpinDale in #678 - fix: chunked prefill with v2 block manager by @AlpinDale in #679
- fix: phi3v batch inference with different aspect ratio images by @AlpinDale in #680
- chore: use mark_dynamic to reduce TPU compile times by @AlpinDale in #681
- chore: bump lmfe to v0.10.6 and include triton for tpu and xpu dockefiles by @AlpinDale in #682
- refactor: base worker input refactor for multi-step by @AlpinDale in #683
- build: add empty device by @AlpinDale in #684
- chore: update flashinfer to v0.1.3 by @AlpinDale in #685
- feat: allow image embeddings for VLM input by @AlpinDale in #686
- feat: add progress bar for loading individual weight modules by @AlpinDale in #640
- chore: use public ECR for neuron image by @AlpinDale in #687
- fix: logit softcapping in flash-attn by @AlpinDale in #688
- chore: use scalar type to dispatch to different
gptq_marlin
kernels by @AlpinDale in #689 - fix: allow passing float for GiB arguments by @AlpinDale in #690
- build: bump cmake to 3.26 by @AlpinDale in #691
- fix: shut down ray dag workers cleanly by @AlpinDale in #692
- feat: add lora loading/unloading api endpoint by @AlpinDale in #693
- feat: add load/unload endpoints for soft-prompts by @AlpinDale in #694
- fix: loading chameleon model with TP>1 by @AlpinDale in #695
- fix: consolidated
is_tpu()
and suppress tpu import warning by @AlpinDale in #696 - fix: manually install triton for other devices to prevent outlines errors by @AlpinDale in #697
- feat: support for Audio modality by @AlpinDale in #698
- chore: migrate gptq_marlin to AphroditeParameters by @AlpinDale in #699
- chore: update fused MoE weight loading by @AlpinDale in #700
- feat: add Solar model support by @AlpinDale in #701
- feat: migrate awq and awq_marlin to AphroditeParameter by @AlpinDale in #702
- chore: spawn engine process from api server process by @AlpinDale in #703
- chore: use the
compressed-tensors
library to avoid code reuse by @AlpinDale in #704 - feat: add aphrodite plugin system by @AlpinDale in #705
- Revert "chore: use the
compressed-tensors
library to avoid code reuse (#704)" by @AlpinDale in #706 - feat: add support for multi-host TPU by @AlpinDale in #707
- fix: import ray under a guard by @AlpinDale in #708
- fix: empty sampler output when temperature is too low by @AlpinDale in #709
- fix: disable embeddings API for chat models by @AlpinDale in #710
- feat: implement mistral tokenizer mode by @AlpinDale in #711
- feat: support profiling with multiple multi-modal inputs per prompt by @AlpinDale in #712
- chore: multi-step args and sequence modifications by @AlpinDale in #713
- chore: set per-rank XLA cache for TPU by @AlpinDale in #714
- chore: add support for up to 2048 block size by @AlpinDale in #715
- fix: install protobuf for cpu by @AlpinDale in #716
- fix: weight loading for scalars by @AlpinDale in #718
- chore: quant config for speculative draft models by @AlpinDale in #719
- feat: enable prompt logprobs in OpenAI API by @AlpinDale in #720
- chore: update grafana template by @AlpinDale in #721
- ci: bump aphrodite to 0.6.1 by @AlpinDale in #722
Full Changelog: v0.6.0.post1...v0.6.1
v0.6.0.post1
What's Changed
- feat: add siglip encoder for llava family by @AlpinDale in #626
- readme: fix model name typo by @Trapper4888 in #627
- feat: multi-image input for minicpmv by @AlpinDale in #628
- feat: Add support for GPU device selection in SpecDecodeBaseSampler by @AlpinDale in #629
- feat: per-tensor token epilogue kernels by @AlpinDale in #630
- chore: optimize evictor v2 performance by @AlpinDale in #631
- feat: initial encoder-decoder support with BART model by @AlpinDale in #633
- fix: default api port and attention selector by @AlpinDale in #634
- fix: clean up incorrect log in worker by @AlpinDale in #636
- bump to v0.6.0.post1 by @AlpinDale in #635
New Contributors
- @Trapper4888 made their first contribution in #627
Full Changelog: v0.6.0...v0.6.0.post1
v0.6.0
v0.6.0 - "Kept you waiting, huh?" Edition
What's Changed
- Fix quants installation on ROCM by @Naomiusearch in #469
- chore: add contribution guidelines + Code of Conduct by @AlpinDale in #507
- Remove
$
from the shell code blocks in README by @matthusby in #538 - [0.6.0] Release Candidate by @AlpinDale in #481
New Contributors
- @matthusby made their first contribution in #538
Full Changelog: v0.5.3...v0.6.0
v0.5.3
What's Changed
A new release, one that took too long again. We have some cool new features, however.
- ExllamaV2 tensor parallel: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.
- Support for Command-R+
- Support for DBRX
- Support for Llama-3
- Support for Qwen 2 MoE
min_tokens
sampling param: You can now set a minimum amount of tokens to generate.- Fused MoE for AWQ and GPTQ quants: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.
- CMake build system: Slightly faster, much cleaner builds.
- CPU support: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.
- Speculative Decoding: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).
- Chunked Prefill: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via
--enable-chunked-prefill
) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache. - Context Shift reworked: Context shift finally works now. Enable it with
--context-shift
and Aphrodite will cache processed prompts and re-use them. - FP8 E4M3 KV Cache: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.
- Auto-truncation in API: The API server can now optionally left-truncate your prompts. Simply pass
truncate_prompt_tokens=1024
to truncate any prompt larger than 1024 tokens. - Support for Llava vision models: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.
- LM Format Enforcer: You can now use LMFE for guided generations.
- EETQ Quantization: EETQ support has been added - a SOTA 8bit quantization method.
- Arbitrary GGUF model support: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.
- Aphrodite CLI app: You no longer have to type
python -m aphrodite...
. Simply typeaphrodite run meta-llama/Meta-Llama-3-8B
to get started. Pass extra flags as normal. - Sharded GGUF support: You can now load sharded GGUF models. Pre-conversion needed.
- NVIDIA P100/GP100 support: Support has been restored.
Thanks to all the new contributors!
Full Changelog: v0.5.2...v0.5.3
v0.5.2
What's Changed
A few fixes and new additions:
- Support for CohereAI's command-r model: Currently, GGUF is unsupported. You can load the base model with
--load-in-4bit
or--load-in-smooth
if you have an RTX 20xx series (or sm_75). - Fix an issue where some GPU blocks were missing. This should give a significant boost to how much context you can use.
- Fix logprobs when -inf with some models.
Full Changelog: v0.5.1...v0.5.2
v0.5.1
What's Changed
- feat(openai): Apply chat template for GGUF loader by @drummerv in #312
- Calculate total memory usage. by @sgsdxzy in #316
- chore: add new iMatrix quants by @AlpinDale in #320
- fix: optimize AQLM dequantization by @AlpinDale in #325
New Contributors
Full Changelog: v0.5.0...v0.5.1
v0.5.0
Aphrodite Engine, Release v0.5.0: It's Quantin' Time Edition
It's been over a month since our last release. Below is re-written using Opus from my crude hand-written release notes.
New Features
-
Exllamav2 Quantization: Exllamav2 quantization has been added, although it's currently limited to a single GPU due to kernel constraints.
-
On-the-Fly Quantization: With the help of
bitsandbytes
andsmoothquant+
, we now support on-the-fly quantization of FP16 models. Use--load-in-4bit
for lightning-fast 4-bit quantization withsmoothquant+
,--load-in-smooth
for 8-bit quantization usingsmoothquant+
, and--load-in-8bit
for 8-bit quantization using thebitsandbytes
library (note: this option is quite slow).--load-in-4bit
needs Ampere GPUs and above, the other two need Turing and above. -
Marlin Quantization: Marlin quantization support has arrived, promising improved speeds at high batch sizes. Convert your GPTQ models to Marlin, but keep in mind that they must be 4-bit, with a group_size of -1 or 128, and act_order set to False.
-
AQLM Quantization: We now support the state-of-the-art 2-bit quantization scheme, AQLM. Please note that both quantization and inference are extremely slow with this method. Quantizing llama-2 70b on 8x A100s reportedly takes 12 days, and on a single 3090 it takes 70 seconds to reach the prompt processing phase. Use this option with caution, as the wait process may cause the engine to timeout (set to 60 seconds).
-
INT8 KV Cache Quantization: In addition to fp8_e5m2, we now support INT8 KV Cache. Unlike FP8, it doesn't speed up the throughput (it stays the same), but should offer higher quality, due to the calibration process. Uses the
smoothquant
algorithm for the quantization. -
Implicit GGUF Model Conversion: Simply point the
--model
flag to your GGUF file, and it will work out of the box. Be aware that this process requires a considerable amount of RAM to load the model, convert tensors to a PyTorch state_dict, and then load them. Plan accordingly or convert first if you're short on RAM. -
LoRA support in the API: The API now supports loading and inferencing LoRAs! Please refer to the wiki for detailed instructions.
-
New Model Support: We've added support for a wide range of models, including OPT, Baichuan, Bloom, ChatGLM, Falcon, Gemma, GPT2, GPT Bigcode, InternLM2, MPT, OLMo, Qwen, Qwen2, and StableLM.
-
Fused Mixtral MoE: Mixtral models (FP16 only) now utilize tensor parallelism with fused kernels, replacing the previous expert parallelism approach. Quantized Mixtrals still have this limitation, but we plan to address it by the next release.
-
Fused Top-K Kernels for MoE: This improvement benefits Mixtral and DeepSeek-MoE models by accelerating the top-k operation using custom CUDA kernels instead of
torch.topk
. -
Enhanced OpenAI Endpoint: The OpenAI endpoint has been refactored, introducing JSON and Regex schemas, as well as a detokenization endpoint.
-
LoRA Support for Mixtral Models: You can now use LoRA with Mixtral models.
-
Fine-Grained Seeds: Introduce randomness to your requests with per-request seeds.
-
Context Shift: We have a naive context shifting mechanism. While it's not as effective as we'd like, it's available for experimentation purposes. Enable it using the
--context-shift
flag. -
Cubic Sampling: Building upon quadratic sampling's smoothing_factor, we now support smoothing_curve.
-
Navi AMD GPU Support: GPUs like the 7900 XTX are now supported, although still experimental and requiring significant compilation efforts due to xformers.
-
Kobold API Deprecation: The Kobold API has been deprecated and merged into the OpenAI API. Launch the OpenAI API using the
--launch-kobold-api
flag. Please note that Kobold routes are not protected with the API key. -
LoRA Support for Quantized Models: We've added LoRA support for GPTQ and AWQ quantized models.
-
Logging Experience Overhaul: We've revamped the logging experience using a custom
loguru
class, inspired by tabbyAPI's recent changes. -
Informative Logging Metrics: Logging has been enhanced to display model memory usage and reduce display bloat, among other improvements.
-
Ray Worker Health Check: The engine now performs health checks on Ray workers, promptly reporting any silent failures or timeouts.
Bug Fixes
- Resolved an issue where
smoothing_factor
would break at high batch sizes. - Fixed a bug with LoRA vocab embeddings.
- Addressed the missing CUDA suffixes in the version number (e.g.,
0.5.0+cu118
). The suffix is now appended when using a CUDA version other than 12.1. - Dynatemp has been split into min/max from range. The Kobold endpoint still accepts a range as input.
- Fixed worker initialization in WSL.
- Removed the accidental inclusion of FP8 kernels in the ROCm build process.
- The EOS token is now removed by default from the output, unrelated to the API.
- Resolved memory leaks caused by NCCL CUDA graphs.
- Improved garbage collection for LoRAs.
- Optimized the execution of embedded runtime scripts.
Upcoming Improvements
Here's a sneak peek at what we're working on for the next release:
- Investigating tensor parallelism with Exllamav2
- Addressing the issue of missing GPU blocks for GGUF and Exl2 (we already have a fix for FP16, GPTQ, and AWQ)
New Contributors
- @anon998 made their first contribution in #253
- @sgsdxzy made their first contribution in #256
- @SwadicalRag made their first contribution in #268
- @thomas-xin made their first contribution in #260
- @StefanDanielSchwarz made their first contribution in #264
- @Pyroserenus made their first contribution in #296
- @Autumnlight02 made their first contribution in #288
Full Changelog: v0.4.9...v0.5.0