Releases: PygmalionAI/aphrodite-engine
Releases Β· PygmalionAI/aphrodite-engine
v0.6.5
What's Changed
- xpu: refactor XPU worker & executor by @AlpinDale in #861
- build: add jinja2 to requirements file by @AlpinDale in #862
- attention: add
AttentionState
abstraction by @AlpinDale in #863 - xpu: disable punica kernels for XPU by @AlpinDale in #864
- executor: pipe
worker_class_fn
arg in executor by @AlpinDale in #865 - server: log the process occupying our port by @AlpinDale in #866
- feat: AWQ quantization for InternVL by @AlpinDale in #867
- Rewrite DRY sampler to be a lot faster by @50h100a in #868
- fix: ROCm build by @Naomiusearch in #817
- fix: temp_last warning being repeated for every output token by @AlpinDale in #869
- feat: add support for chunked prefill + prefix caching by @AlpinDale in #871
- async: avoid premature exit in the async generator by @AlpinDale in #872
- cpu: fix
mm_limits
initialization by @AlpinDale in #873 - spec decoding: set the draft model ctxlen to target model by @AlpinDale in #874
- sampler: pad dry sequence breakers tensor by @AlpinDale in #875
- fix:
add_generation_template
->add_generation_prompt
in llm by @AlpinDale in #877 - Update README.md by @NoahBPeterson in #876
- api: fix crashes under very high loads by @AlpinDale in #878
- build: pass
PYTHONPATH
from setup.py to cmake by @AlpinDale in #879 - async: disable multi-step scheduling for sync engine by @AlpinDale in #880
- api: better startup failure UX by @AlpinDale in #881
- chore: consolidate environment variables within one file by @AlpinDale in #882
- core: fix spec decode metrics and envs circular import by @AlpinDale in #889
- feat: add support for audio models by @AlpinDale in #891
- distributed: fix issue for when nodes have multiple network interfaces by @AlpinDale in #892
- rocm: fix compile issues with rocm 6.2 by @AlpinDale in #893
- build: fix invalid path for envs.py in setup by @AlpinDale in #894
- kernel: use
cub::BlockReduce
instead of custom impl by @AlpinDale in #895 - fix: Phi 3.5 Vision model loading by @AlpinDale in #896
- api: add client timeouts for the ZeroMQ server by @AlpinDale in #897
- feat: add torch.compile for GemmaRMSNorm by @AlpinDale in #898
- spec decode: add support for EAGLE by @AlpinDale in #899
- fix:
ShardedStateLoader
with fp8 quant by @AlpinDale in #900 - kernel: do not compile machete for cuda 11 and below by @AlpinDale in #901
- chore: add AphroditeParameter support for FP8 quant by @AlpinDale in #902
- spec decode: fix logprobs when using speculative decoding by @AlpinDale in #904
- api: error suppression cleanup + timeout suppression on aborts by @AlpinDale in #905
- ray: better error when placement group topology is incorrect by @AlpinDale in #906
- xpu: refactor the model runner for tensor parallelism by @AlpinDale in #910
- fix: empty prompt crashing the server by @AlpinDale in #912
- quantization: update marlin to use
AphroditeParameters
by @AlpinDale in #913 - core: add multi-step scheduling support for the synchronous engine by @AlpinDale in #914
- api: add json_schema to OpenAI server by @AlpinDale in #915
- fix: phi3v crash with unusual image sizes by @AlpinDale in #916
- feat: multi-image input support for Phi3V by @AlpinDale in #917
- spec decode: streamline batch expansion tensor manipulation by @AlpinDale in #918
- api: use fp32 for base64 embeddings by @AlpinDale in #919
- core: improve warmup times for prefix caching in block manager v2 by @AlpinDale in #920
- quants: update
qqq
andgptq_marlin_24
to use AphroditeParameters by @AlpinDale in #921 - distributed: fix custom allreduce p2p cache file generation by @AlpinDale in #922
- neuron: add support for tensor parallelism by @AlpinDale in #923
- quants: update compressed tensors lifecycle to remove
prefix
fromcreate_weights
by @AlpinDale in #924 - feat: add async postprocessor by @AlpinDale in #925
- api: add endpoint for loading and unloading the model by @AlpinDale in #926
- feat: add single user mode by @AlpinDale in #927
- api: add inline model loading by @AlpinDale in #928
- api: support aphrodite_config.yaml with inline loading by @AlpinDale in #929
- fix: inline model loading conflicts with lora by @AlpinDale in #930
- core: do not compile for profiling by @AlpinDale in #931
- xpu: support pipeline parallel by @AlpinDale in #932
- fix: phi3v image_idx in async server by @AlpinDale in #933
- feat: add fused Marlin MoE kernel by @AlpinDale in #934
- chore: multi-image support for llava-next by @AlpinDale in #935
- model: add support for paligemma2 by @AlpinDale in #936
- vlm: stack multimodal tensors to represent multiple images within each prompt by @AlpinDale in #937
- core: do not compile ScalarType for torch < 2.4.0 by @AlpinDale in #938
- core: add virtual engine for async outproc by @AlpinDale in #939
- api: log prompt truncation by @AlpinDale in #940
- vlm: fix incompatibility nested tensors and multi-image llava-next by @AlpinDale in #941
- vlm: fix persimmon and fuyu issues with transformers 4.45 by @AlpinDale in #942
- Fix SentencePieceTokenizer error when generating on Mistral Large 2411 with
--tokenizer-mode mistral
by @khanonnie in #943 - core: use flashinfer for FP8 KV when available by @AlpinDale in #944
- tests: update flashinfer test for #944 by @AlpinDale in #945
- quants: add triton kernels for AWQ by @AlpinDale in #946
- tests: add kernel tests for causal_conv1d and mamba_ssm by @AlpinDale in #947
- fix: do not register punica with torch if using older torch by @AlpinDale in #948
- tpu: avoid dynamo guard eval overhead by @AlpinDale in #949
- fix: issues with flashinfer fp8 kv by @AlpinDale in #950
- api: optimize zeromq frontend performance by @AlpinDale in #951
- tpu: remove torch._dynamo.reset() by @AlpinDale in #952
- vlm: fix errors on ragged NestedTensors by @AlpinDal...
v0.6.4.post1
What's Changed
- add linux arm64/aarch64/GH200 installation tips by @qpwo in #851
- DRY Fix: Add output_tokens to sampler by @selalipop in #849
- sampler: fix DRY concurrency issue by @AlpinDale in #852
- sampler: add range parameter for DRY by @AlpinDale in #855
- sampler: optimize DRY performance using z-algorithm by @AlpinDale in #856
- sampler: allow parsing sampler order using strings by @AlpinDale in #858
New Contributors
Full Changelog: v0.6.4...v0.6.4.post1
v0.6.4
What's Changed
- frontend: enable kobold api by default by @AlpinDale in #803
- feat: add serviceinfo endpoint by @AlpinDale in #807
- feat: update to serviceinfo v0.2 by @AlpinDale in #808
- Mask dynatemp using min/max, rather than exp by @50h100a in #813
- fix: temperature issues by @50h100a in #814
- fix: --max-seq-len-to-capture arg by @AlpinDale in #818
- [IMPORTANT] updating test units by @AlpinDale in #769
- fix: tokenization api test by @AlpinDale in #821
- feat: add chat method for LLM class by @AlpinDale in #822
- feat: support chunked prefill with LoRA by @AlpinDale in #823
- SPMD optimizations by @AlpinDale in #824
- fix: sampler test with new transformers version by @AlpinDale in #826
- feat: add cuda sampling kernels for top_k and top_p by @AlpinDale in #828
- feat: add metrics for prefix cache hit rate by @AlpinDale in #829
- fix: unbound tokenizer error by @AlpinDale in #830
- feat: multi-step scheduling by @AlpinDale in #831
- feat: Add DRY (Do not Repeat Yourself) sampling by @selalipop in #827
- feat: add no_repeat_ngram sampler by @AlpinDale in #832
- feat: add skew sampling by @AlpinDale in #834
- fix: hidden states handling in batch expansion for spec decoding by @AlpinDale in #839
- chore: refactor executor classes for easier inheritance by @AlpinDale in #840
- fix: latency and serving benchmarks by @AlpinDale in #841
- feat: Machete Kernels for Hopper GPUs by @AlpinDale in #842
- feat: add sampler_priorty by @AlpinDale in #837
- fix: disable awq_marlin override for awq models by @AlpinDale in #843
- chore: bump mistral_common to 1.5.0 by @AlpinDale in #844
- ci: bump version to 0.6.4 by @AlpinDale in #845
New Contributors
- @dependabot made their first contribution in #796
- @selalipop made their first contribution in #827
Full Changelog: v0.6.3...v0.6.4
v0.6.3.post1
What's Changed
- build(deps): bump rollup from 4.21.0 to 4.24.3 in /docs by @dependabot in #796
- fix: compilation of gptq_marlin_gemm object by @AlpinDale in #800
- ci: bump to 0.6.3.post1 by @AlpinDale in #801
New Contributors
- @dependabot made their first contribution in #796
Full Changelog: v0.6.3...v0.6.3.post1
v0.6.3
What's Changed
- Stream models rather than load them completely into RAM. by @50h100a in #785
- feat: windows support by @AlpinDale in #790
- fix: windows wheel url by @AlpinDale in #794
- fix: kobold lite embedded UI on windows by @AlpinDale in #797
- feat: add HQQ quantization support by @AlpinDale in #795
- frontend: minor logging improvements by @AlpinDale in #787
- ci: bump version to 0.6.3 by @AlpinDale in #799
Full Changelog: v0.6.2.post1...v0.6.3
v0.6.2.post1
What's Changed
- fix: kobold api for horde by @AlpinDale in #763
- Fix for a crash from token bans by @Pyroserenus in #764
- Modified throughput benchmark to allow --max-num-seqs by @Pyroserenus in #770
- Simplify construction of sampling_metadata by @50h100a in #766
- Add OLMoE by @fizzAI in #772
- feat: ministral support by @AlpinDale in #776
- Make amd usable by @Naomiusearch in #775
- docker: apply AMD patch in the dockerfile by @AlpinDale in #777
- fix: demote skip_special_tokens assertion to logger error by @AlpinDale in #778
- ci: bump version to 0.6.2.post1 by @AlpinDale in #779
New Contributors
Full Changelog: v0.6.2...v0.6.2.post1
v0.6.2
What's Changed
- feat: FP8 quantization support for AMD ROCm by @AlpinDale in #729
- feat: add experts_int8 support by @AlpinDale in #730
- chore: move update_flash_attn_metadata to attn backend by @AlpinDale in #731
- chore: register lora functions as torch ops by @AlpinDale in #732
- feat: dynamo support for ScalarType by @AlpinDale in #733
- fix: types in AQLM and GGUF for dynamo support by @AlpinDale in #736
- fix:
custom_ar
check by @AlpinDale in #737 - fix: clear engine ref in RPC server by @AlpinDale in #738
- fix: use nvml to get consistent device names by @AlpinDale in #739
- feat: add Exaone model support by @shing100 in #743
- fix: minor bug fixes & clean-ups by @AlpinDale in #744
- chore: refactor
MultiModalConfig
initialization and profiling by @AlpinDale in #745 - chore: various TPU fixes and optimizations by @AlpinDale in #746
- fix: metrics endpoint with RPC server by @AlpinDale in #747
- chore: refactor llama3 rope by @AlpinDale in #748
- feat: add XTC Sampling by @AlpinDale in #740
- ci: fix dep install using pnpm by @ahme-dev in #749
- ci: fix docs deployment by @ahme-dev in #750
- chore: re-enable custom token bans by @AlpinDale in #751
- feat: bring back dynatemp by @AlpinDale in #754
- feat: quant_llm support by @AlpinDale in #755
- fix: add pandas to requirements by @AlpinDale in #756
- docs: update readme and quant docs by @AlpinDale in #757
- ci: bump version to 0.6.2 by @AlpinDale in #758
New Contributors
Full Changelog: v0.6.1.post1...v0.6.2
v0.6.1.post1
What's Changed
- chore: register custom torch ops for flash-attn and flashinfer by @AlpinDale in #724
- feat: launch API server with uvloop by @AlpinDale in #725
- chore: fix return statement in
Detokenizer.decode_sequence_inplace
by @AlpinDale in #727 - Fix tensor parallelism, libcudart path for some versions of pytorch by @miku448 in #726
- ci: bump to 0.6.1.post1 by @AlpinDale in #728
Full Changelog: v0.6.1...v0.6.1.post1
v0.6.1
Aphrodite Engine - v0.6.1
What's Changed
- ci: exclude cu118 from build and add py_limited_api by @AlpinDale in #639
- fix: better async request cancellation by @AlpinDale in #641
- fix: gracefully handle missing chat template by @AlpinDale in #642
- chore: deduplicate nvlink check to cuda platform by @AlpinDale in #643
- fix: hardcoded float16 in embedding mode check by @AlpinDale in #645
- quadratic sampling: separate diff from logits to filter out NaNs. by @50h100a in #644
- fix: RSLoRA support by @AlpinDale in #647
- feat: introduce
BaseAphroditeParameter
by @AlpinDale in #646 - fix: move zeromq rpc frontend to IPC instead of TCP by @AlpinDale in #652
- fix: input processor in internvl2 by @AlpinDale in #653
- fix: multiprocessing timeout by @AlpinDale in #654
- fix: GPTQ/AWQ on Colab by @AlpinDale in #655
- fix: make
merge_async_iterators.is_cancelled()
optional by @AlpinDale in #656 - fix: flashinfer outputs by @AlpinDale in #657
- fix: max_num_batched_tokens should not be limited for lora by @AlpinDale in #658
- fix: lora with pipeline parallel by @AlpinDale in #659
- fix: kill api server when pinging dead engine by @AlpinDale in #660
- fix:
get_num_blocks_touched
logic by @AlpinDale in #661 - chore: update the env.py script and the bug report template by @AlpinDale in #662
- feat: add INT8 W8A16 quant for TPU by @AlpinDale in #663
- feat: allow serving encoder-decoder models in the API server by @AlpinDale in #664
- fix: deps with TPU dockerfile by @AlpinDale in #665
- optimization: reduce end-to-end overhead from python obj allocation by @AlpinDale in #666
- fix: minor adjustments to scheduler and block manager by @AlpinDale in #667
- feat: enable using fp8 kv and prefix caching with chunked prefill by @AlpinDale in #668
- fix: mlpspeculator with padded vocab by @AlpinDale in #669
- feat: option to apply temperature scaling last by @AlpinDale in #670
- chore: decouple
should_modify_greedy_probs_inplace
by @AlpinDale in #671 - chore: better stream termination in async engine by @AlpinDale in #672
- chore: mamba cache single buffer by @AlpinDale in #673
- feat: mamba model support by @AlpinDale in #674
- fix: reinit procedure in
ModelInputForGPUBuilder
by @AlpinDale in #675 - feat: embeddings support for batched OAI endpoint by @AlpinDale in #676
- fix: fp8 checkpoints with fused linear modules by @AlpinDale in #677
- feat: add numpy implementation of
compute_slot_mapping
by @AlpinDale in #678 - fix: chunked prefill with v2 block manager by @AlpinDale in #679
- fix: phi3v batch inference with different aspect ratio images by @AlpinDale in #680
- chore: use mark_dynamic to reduce TPU compile times by @AlpinDale in #681
- chore: bump lmfe to v0.10.6 and include triton for tpu and xpu dockefiles by @AlpinDale in #682
- refactor: base worker input refactor for multi-step by @AlpinDale in #683
- build: add empty device by @AlpinDale in #684
- chore: update flashinfer to v0.1.3 by @AlpinDale in #685
- feat: allow image embeddings for VLM input by @AlpinDale in #686
- feat: add progress bar for loading individual weight modules by @AlpinDale in #640
- chore: use public ECR for neuron image by @AlpinDale in #687
- fix: logit softcapping in flash-attn by @AlpinDale in #688
- chore: use scalar type to dispatch to different
gptq_marlin
kernels by @AlpinDale in #689 - fix: allow passing float for GiB arguments by @AlpinDale in #690
- build: bump cmake to 3.26 by @AlpinDale in #691
- fix: shut down ray dag workers cleanly by @AlpinDale in #692
- feat: add lora loading/unloading api endpoint by @AlpinDale in #693
- feat: add load/unload endpoints for soft-prompts by @AlpinDale in #694
- fix: loading chameleon model with TP>1 by @AlpinDale in #695
- fix: consolidated
is_tpu()
and suppress tpu import warning by @AlpinDale in #696 - fix: manually install triton for other devices to prevent outlines errors by @AlpinDale in #697
- feat: support for Audio modality by @AlpinDale in #698
- chore: migrate gptq_marlin to AphroditeParameters by @AlpinDale in #699
- chore: update fused MoE weight loading by @AlpinDale in #700
- feat: add Solar model support by @AlpinDale in #701
- feat: migrate awq and awq_marlin to AphroditeParameter by @AlpinDale in #702
- chore: spawn engine process from api server process by @AlpinDale in #703
- chore: use the
compressed-tensors
library to avoid code reuse by @AlpinDale in #704 - feat: add aphrodite plugin system by @AlpinDale in #705
- Revert "chore: use the
compressed-tensors
library to avoid code reuse (#704)" by @AlpinDale in #706 - feat: add support for multi-host TPU by @AlpinDale in #707
- fix: import ray under a guard by @AlpinDale in #708
- fix: empty sampler output when temperature is too low by @AlpinDale in #709
- fix: disable embeddings API for chat models by @AlpinDale in #710
- feat: implement mistral tokenizer mode by @AlpinDale in #711
- feat: support profiling with multiple multi-modal inputs per prompt by @AlpinDale in #712
- chore: multi-step args and sequence modifications by @AlpinDale in #713
- chore: set per-rank XLA cache for TPU by @AlpinDale in #714
- chore: add support for up to 2048 block size by @AlpinDale in #715
- fix: install protobuf for cpu by @AlpinDale in #716
- fix: weight loading for scalars by @AlpinDale in #718
- chore: quant config for speculative draft models by @AlpinDale in #719
- feat: enable prompt logprobs in OpenAI API by @AlpinDale in #720
- chore: update grafana template by @AlpinDale in #721
- ci: bump aphrodite to 0.6.1 by @AlpinDale in #722
Full Changelog: v0.6.0.post1...v0.6.1
v0.6.0.post1
What's Changed
- feat: add siglip encoder for llava family by @AlpinDale in #626
- readme: fix model name typo by @Trapper4888 in #627
- feat: multi-image input for minicpmv by @AlpinDale in #628
- feat: Add support for GPU device selection in SpecDecodeBaseSampler by @AlpinDale in #629
- feat: per-tensor token epilogue kernels by @AlpinDale in #630
- chore: optimize evictor v2 performance by @AlpinDale in #631
- feat: initial encoder-decoder support with BART model by @AlpinDale in #633
- fix: default api port and attention selector by @AlpinDale in #634
- fix: clean up incorrect log in worker by @AlpinDale in #636
- bump to v0.6.0.post1 by @AlpinDale in #635
New Contributors
- @Trapper4888 made their first contribution in #627
Full Changelog: v0.6.0...v0.6.0.post1