22 Dec 06:43

github-actions

cbd51a2

v0.6.5 Latest

Latest

What's Changed

xpu: refactor XPU worker & executor by @AlpinDale in #861
build: add jinja2 to requirements file by @AlpinDale in #862
attention: add AttentionState abstraction by @AlpinDale in #863
xpu: disable punica kernels for XPU by @AlpinDale in #864
executor: pipe worker_class_fn arg in executor by @AlpinDale in #865
server: log the process occupying our port by @AlpinDale in #866
feat: AWQ quantization for InternVL by @AlpinDale in #867
Rewrite DRY sampler to be a lot faster by @50h100a in #868
fix: ROCm build by @Naomiusearch in #817
fix: temp_last warning being repeated for every output token by @AlpinDale in #869
feat: add support for chunked prefill + prefix caching by @AlpinDale in #871
async: avoid premature exit in the async generator by @AlpinDale in #872
cpu: fix mm_limits initialization by @AlpinDale in #873
spec decoding: set the draft model ctxlen to target model by @AlpinDale in #874
sampler: pad dry sequence breakers tensor by @AlpinDale in #875
fix: add_generation_template -> add_generation_prompt in llm by @AlpinDale in #877
Update README.md by @NoahBPeterson in #876
api: fix crashes under very high loads by @AlpinDale in #878
build: pass PYTHONPATH from setup.py to cmake by @AlpinDale in #879
async: disable multi-step scheduling for sync engine by @AlpinDale in #880
api: better startup failure UX by @AlpinDale in #881
chore: consolidate environment variables within one file by @AlpinDale in #882
core: fix spec decode metrics and envs circular import by @AlpinDale in #889
feat: add support for audio models by @AlpinDale in #891
distributed: fix issue for when nodes have multiple network interfaces by @AlpinDale in #892
rocm: fix compile issues with rocm 6.2 by @AlpinDale in #893
build: fix invalid path for envs.py in setup by @AlpinDale in #894
kernel: use cub::BlockReduce instead of custom impl by @AlpinDale in #895
fix: Phi 3.5 Vision model loading by @AlpinDale in #896
api: add client timeouts for the ZeroMQ server by @AlpinDale in #897
feat: add torch.compile for GemmaRMSNorm by @AlpinDale in #898
spec decode: add support for EAGLE by @AlpinDale in #899
fix: ShardedStateLoader with fp8 quant by @AlpinDale in #900
kernel: do not compile machete for cuda 11 and below by @AlpinDale in #901
chore: add AphroditeParameter support for FP8 quant by @AlpinDale in #902
spec decode: fix logprobs when using speculative decoding by @AlpinDale in #904
api: error suppression cleanup + timeout suppression on aborts by @AlpinDale in #905
ray: better error when placement group topology is incorrect by @AlpinDale in #906
xpu: refactor the model runner for tensor parallelism by @AlpinDale in #910
fix: empty prompt crashing the server by @AlpinDale in #912
quantization: update marlin to use AphroditeParameters by @AlpinDale in #913
core: add multi-step scheduling support for the synchronous engine by @AlpinDale in #914
api: add json_schema to OpenAI server by @AlpinDale in #915
fix: phi3v crash with unusual image sizes by @AlpinDale in #916
feat: multi-image input support for Phi3V by @AlpinDale in #917
spec decode: streamline batch expansion tensor manipulation by @AlpinDale in #918
api: use fp32 for base64 embeddings by @AlpinDale in #919
core: improve warmup times for prefix caching in block manager v2 by @AlpinDale in #920
quants: update qqq and gptq_marlin_24 to use AphroditeParameters by @AlpinDale in #921
distributed: fix custom allreduce p2p cache file generation by @AlpinDale in #922
neuron: add support for tensor parallelism by @AlpinDale in #923
quants: update compressed tensors lifecycle to remove prefix from create_weights by @AlpinDale in #924
feat: add async postprocessor by @AlpinDale in #925
api: add endpoint for loading and unloading the model by @AlpinDale in #926
feat: add single user mode by @AlpinDale in #927
api: add inline model loading by @AlpinDale in #928
api: support aphrodite_config.yaml with inline loading by @AlpinDale in #929
fix: inline model loading conflicts with lora by @AlpinDale in #930
core: do not compile for profiling by @AlpinDale in #931
xpu: support pipeline parallel by @AlpinDale in #932
fix: phi3v image_idx in async server by @AlpinDale in #933
feat: add fused Marlin MoE kernel by @AlpinDale in #934
chore: multi-image support for llava-next by @AlpinDale in #935
model: add support for paligemma2 by @AlpinDale in #936
vlm: stack multimodal tensors to represent multiple images within each prompt by @AlpinDale in #937
core: do not compile ScalarType for torch < 2.4.0 by @AlpinDale in #938
core: add virtual engine for async outproc by @AlpinDale in #939
api: log prompt truncation by @AlpinDale in #940
vlm: fix incompatibility nested tensors and multi-image llava-next by @AlpinDale in #941
vlm: fix persimmon and fuyu issues with transformers 4.45 by @AlpinDale in #942
Fix SentencePieceTokenizer error when generating on Mistral Large 2411 with --tokenizer-mode mistral by @khanonnie in #943
core: use flashinfer for FP8 KV when available by @AlpinDale in #944
tests: update flashinfer test for #944 by @AlpinDale in #945
quants: add triton kernels for AWQ by @AlpinDale in #946
tests: add kernel tests for causal_conv1d and mamba_ssm by @AlpinDale in #947
fix: do not register punica with torch if using older torch by @AlpinDale in #948
tpu: avoid dynamo guard eval overhead by @AlpinDale in #949
fix: issues with flashinfer fp8 kv by @AlpinDale in #950
api: optimize zeromq frontend performance by @AlpinDale in #951
tpu: remove torch._dynamo.reset() by @AlpinDale in #952
vlm: fix errors on ragged NestedTensors by @AlpinDal...

Contributors

NoahBPeterson, AlpinDale, and 3 other contributors

Assets 8

03 Dec 01:51

github-actions

v0.6.4.post1

8b8d2ce

v0.6.4.post1

What's Changed

add linux arm64/aarch64/GH200 installation tips by @qpwo in #851
DRY Fix: Add output_tokens to sampler by @selalipop in #849
sampler: fix DRY concurrency issue by @AlpinDale in #852
sampler: add range parameter for DRY by @AlpinDale in #855
sampler: optimize DRY performance using z-algorithm by @AlpinDale in #856
sampler: allow parsing sampler order using strings by @AlpinDale in #858

New Contributors

@qpwo made their first contribution in #851

Full Changelog: v0.6.4...v0.6.4.post1

Contributors

qpwo, selalipop, and AlpinDale

Assets 3

27 Nov 07:31

github-actions

v0.6.4

d2971a6

v0.6.4

What's Changed

frontend: enable kobold api by default by @AlpinDale in #803
feat: add serviceinfo endpoint by @AlpinDale in #807
feat: update to serviceinfo v0.2 by @AlpinDale in #808
Mask dynatemp using min/max, rather than exp by @50h100a in #813
fix: temperature issues by @50h100a in #814
fix: --max-seq-len-to-capture arg by @AlpinDale in #818
[IMPORTANT] updating test units by @AlpinDale in #769
fix: tokenization api test by @AlpinDale in #821
feat: add chat method for LLM class by @AlpinDale in #822
feat: support chunked prefill with LoRA by @AlpinDale in #823
SPMD optimizations by @AlpinDale in #824
fix: sampler test with new transformers version by @AlpinDale in #826
feat: add cuda sampling kernels for top_k and top_p by @AlpinDale in #828
feat: add metrics for prefix cache hit rate by @AlpinDale in #829
fix: unbound tokenizer error by @AlpinDale in #830
feat: multi-step scheduling by @AlpinDale in #831
feat: Add DRY (Do not Repeat Yourself) sampling by @selalipop in #827
feat: add no_repeat_ngram sampler by @AlpinDale in #832
feat: add skew sampling by @AlpinDale in #834
fix: hidden states handling in batch expansion for spec decoding by @AlpinDale in #839
chore: refactor executor classes for easier inheritance by @AlpinDale in #840
fix: latency and serving benchmarks by @AlpinDale in #841
feat: Machete Kernels for Hopper GPUs by @AlpinDale in #842
feat: add sampler_priorty by @AlpinDale in #837
fix: disable awq_marlin override for awq models by @AlpinDale in #843
chore: bump mistral_common to 1.5.0 by @AlpinDale in #844
ci: bump version to 0.6.4 by @AlpinDale in #845

New Contributors

@dependabot made their first contribution in #796
@selalipop made their first contribution in #827

Full Changelog: v0.6.3...v0.6.4

Contributors

selalipop, dependabot, and 2 other contributors

Assets 3

02 Nov 19:11

github-actions

v0.6.3.post1

f0e00f1

v0.6.3.post1

What's Changed

build(deps): bump rollup from 4.21.0 to 4.24.3 in /docs by @dependabot in #796
fix: compilation of gptq_marlin_gemm object by @AlpinDale in #800
ci: bump to 0.6.3.post1 by @AlpinDale in #801

New Contributors

@dependabot made their first contribution in #796

Full Changelog: v0.6.3...v0.6.3.post1

Contributors

dependabot and AlpinDale

Assets 4

02 Nov 13:21

github-actions

v0.6.3

76c05c5

v0.6.3

What's Changed

Stream models rather than load them completely into RAM. by @50h100a in #785
feat: windows support by @AlpinDale in #790
fix: windows wheel url by @AlpinDale in #794
fix: kobold lite embedded UI on windows by @AlpinDale in #797
feat: add HQQ quantization support by @AlpinDale in #795
frontend: minor logging improvements by @AlpinDale in #787
ci: bump version to 0.6.3 by @AlpinDale in #799

Full Changelog: v0.6.2.post1...v0.6.3

Contributors

AlpinDale and 50h100a

Assets 4

16 Oct 16:39

github-actions

v0.6.2.post1

4d3d819

v0.6.2.post1

What's Changed

fix: kobold api for horde by @AlpinDale in #763
Fix for a crash from token bans by @Pyroserenus in #764
Modified throughput benchmark to allow --max-num-seqs by @Pyroserenus in #770
Simplify construction of sampling_metadata by @50h100a in #766
Add OLMoE by @fizzAI in #772
feat: ministral support by @AlpinDale in #776
Make amd usable by @Naomiusearch in #775
docker: apply AMD patch in the dockerfile by @AlpinDale in #777
fix: demote skip_special_tokens assertion to logger error by @AlpinDale in #778
ci: bump version to 0.6.2.post1 by @AlpinDale in #779

New Contributors

@fizzAI made their first contribution in #772

Full Changelog: v0.6.2...v0.6.2.post1

Contributors

AlpinDale, Naomiusearch, and 3 other contributors

Assets 4

22 Sep 01:51

github-actions

v0.6.2

0e0bd02

v0.6.2

What's Changed

feat: FP8 quantization support for AMD ROCm by @AlpinDale in #729
feat: add experts_int8 support by @AlpinDale in #730
chore: move update_flash_attn_metadata to attn backend by @AlpinDale in #731
chore: register lora functions as torch ops by @AlpinDale in #732
feat: dynamo support for ScalarType by @AlpinDale in #733
fix: types in AQLM and GGUF for dynamo support by @AlpinDale in #736
fix: custom_ar check by @AlpinDale in #737
fix: clear engine ref in RPC server by @AlpinDale in #738
fix: use nvml to get consistent device names by @AlpinDale in #739
feat: add Exaone model support by @shing100 in #743
fix: minor bug fixes & clean-ups by @AlpinDale in #744
chore: refactor MultiModalConfig initialization and profiling by @AlpinDale in #745
chore: various TPU fixes and optimizations by @AlpinDale in #746
fix: metrics endpoint with RPC server by @AlpinDale in #747
chore: refactor llama3 rope by @AlpinDale in #748
feat: add XTC Sampling by @AlpinDale in #740
ci: fix dep install using pnpm by @ahme-dev in #749
ci: fix docs deployment by @ahme-dev in #750
chore: re-enable custom token bans by @AlpinDale in #751
feat: bring back dynatemp by @AlpinDale in #754
feat: quant_llm support by @AlpinDale in #755
fix: add pandas to requirements by @AlpinDale in #756
docs: update readme and quant docs by @AlpinDale in #757
ci: bump version to 0.6.2 by @AlpinDale in #758

New Contributors

@shing100 made their first contribution in #743
@ahme-dev made their first contribution in #749

Full Changelog: v0.6.1.post1...v0.6.2

Contributors

shing100, AlpinDale, and ahme-dev

Assets 3

13 Sep 08:09

github-actions

v0.6.1.post1

c744443

v0.6.1.post1

What's Changed

chore: register custom torch ops for flash-attn and flashinfer by @AlpinDale in #724
feat: launch API server with uvloop by @AlpinDale in #725
chore: fix return statement in Detokenizer.decode_sequence_inplace by @AlpinDale in #727
Fix tensor parallelism, libcudart path for some versions of pytorch by @miku448 in #726
ci: bump to 0.6.1.post1 by @AlpinDale in #728

Full Changelog: v0.6.1...v0.6.1.post1

Contributors

AlpinDale and miku448

Assets 3

12 Sep 03:48

github-actions

v0.6.1

8e0d376

v0.6.1

Aphrodite Engine - v0.6.1

What's Changed

ci: exclude cu118 from build and add py_limited_api by @AlpinDale in #639
fix: better async request cancellation by @AlpinDale in #641
fix: gracefully handle missing chat template by @AlpinDale in #642
chore: deduplicate nvlink check to cuda platform by @AlpinDale in #643
fix: hardcoded float16 in embedding mode check by @AlpinDale in #645
quadratic sampling: separate diff from logits to filter out NaNs. by @50h100a in #644
fix: RSLoRA support by @AlpinDale in #647
feat: introduce BaseAphroditeParameter by @AlpinDale in #646
fix: move zeromq rpc frontend to IPC instead of TCP by @AlpinDale in #652
fix: input processor in internvl2 by @AlpinDale in #653
fix: multiprocessing timeout by @AlpinDale in #654
fix: GPTQ/AWQ on Colab by @AlpinDale in #655
fix: make merge_async_iterators.is_cancelled() optional by @AlpinDale in #656
fix: flashinfer outputs by @AlpinDale in #657
fix: max_num_batched_tokens should not be limited for lora by @AlpinDale in #658
fix: lora with pipeline parallel by @AlpinDale in #659
fix: kill api server when pinging dead engine by @AlpinDale in #660
fix: get_num_blocks_touched logic by @AlpinDale in #661
chore: update the env.py script and the bug report template by @AlpinDale in #662
feat: add INT8 W8A16 quant for TPU by @AlpinDale in #663
feat: allow serving encoder-decoder models in the API server by @AlpinDale in #664
fix: deps with TPU dockerfile by @AlpinDale in #665
optimization: reduce end-to-end overhead from python obj allocation by @AlpinDale in #666
fix: minor adjustments to scheduler and block manager by @AlpinDale in #667
feat: enable using fp8 kv and prefix caching with chunked prefill by @AlpinDale in #668
fix: mlpspeculator with padded vocab by @AlpinDale in #669
feat: option to apply temperature scaling last by @AlpinDale in #670
chore: decouple should_modify_greedy_probs_inplace by @AlpinDale in #671
chore: better stream termination in async engine by @AlpinDale in #672
chore: mamba cache single buffer by @AlpinDale in #673
feat: mamba model support by @AlpinDale in #674
fix: reinit procedure in ModelInputForGPUBuilder by @AlpinDale in #675
feat: embeddings support for batched OAI endpoint by @AlpinDale in #676
fix: fp8 checkpoints with fused linear modules by @AlpinDale in #677
feat: add numpy implementation of compute_slot_mapping by @AlpinDale in #678
fix: chunked prefill with v2 block manager by @AlpinDale in #679
fix: phi3v batch inference with different aspect ratio images by @AlpinDale in #680
chore: use mark_dynamic to reduce TPU compile times by @AlpinDale in #681
chore: bump lmfe to v0.10.6 and include triton for tpu and xpu dockefiles by @AlpinDale in #682
refactor: base worker input refactor for multi-step by @AlpinDale in #683
build: add empty device by @AlpinDale in #684
chore: update flashinfer to v0.1.3 by @AlpinDale in #685
feat: allow image embeddings for VLM input by @AlpinDale in #686
feat: add progress bar for loading individual weight modules by @AlpinDale in #640
chore: use public ECR for neuron image by @AlpinDale in #687
fix: logit softcapping in flash-attn by @AlpinDale in #688
chore: use scalar type to dispatch to different gptq_marlin kernels by @AlpinDale in #689
fix: allow passing float for GiB arguments by @AlpinDale in #690
build: bump cmake to 3.26 by @AlpinDale in #691
fix: shut down ray dag workers cleanly by @AlpinDale in #692
feat: add lora loading/unloading api endpoint by @AlpinDale in #693
feat: add load/unload endpoints for soft-prompts by @AlpinDale in #694
fix: loading chameleon model with TP>1 by @AlpinDale in #695
fix: consolidated is_tpu() and suppress tpu import warning by @AlpinDale in #696
fix: manually install triton for other devices to prevent outlines errors by @AlpinDale in #697
feat: support for Audio modality by @AlpinDale in #698
chore: migrate gptq_marlin to AphroditeParameters by @AlpinDale in #699
chore: update fused MoE weight loading by @AlpinDale in #700
feat: add Solar model support by @AlpinDale in #701
feat: migrate awq and awq_marlin to AphroditeParameter by @AlpinDale in #702
chore: spawn engine process from api server process by @AlpinDale in #703
chore: use the compressed-tensors library to avoid code reuse by @AlpinDale in #704
feat: add aphrodite plugin system by @AlpinDale in #705
Revert "chore: use the compressed-tensors library to avoid code reuse (#704)" by @AlpinDale in #706
feat: add support for multi-host TPU by @AlpinDale in #707
fix: import ray under a guard by @AlpinDale in #708
fix: empty sampler output when temperature is too low by @AlpinDale in #709
fix: disable embeddings API for chat models by @AlpinDale in #710
feat: implement mistral tokenizer mode by @AlpinDale in #711
feat: support profiling with multiple multi-modal inputs per prompt by @AlpinDale in #712
chore: multi-step args and sequence modifications by @AlpinDale in #713
chore: set per-rank XLA cache for TPU by @AlpinDale in #714
chore: add support for up to 2048 block size by @AlpinDale in #715
fix: install protobuf for cpu by @AlpinDale in #716
fix: weight loading for scalars by @AlpinDale in #718
chore: quant config for speculative draft models by @AlpinDale in #719
feat: enable prompt logprobs in OpenAI API by @AlpinDale in #720
chore: update grafana template by @AlpinDale in #721
ci: bump aphrodite to 0.6.1 by @AlpinDale in #722

Full Changelog: v0.6.0.post1...v0.6.1

Contributors

AlpinDale and 50h100a

Assets 3

06 Sep 05:08

github-actions

v0.6.0.post1

bb82d5a

v0.6.0.post1

What's Changed

feat: add siglip encoder for llava family by @AlpinDale in #626
readme: fix model name typo by @Trapper4888 in #627
feat: multi-image input for minicpmv by @AlpinDale in #628
feat: Add support for GPU device selection in SpecDecodeBaseSampler by @AlpinDale in #629
feat: per-tensor token epilogue kernels by @AlpinDale in #630
chore: optimize evictor v2 performance by @AlpinDale in #631
feat: initial encoder-decoder support with BART model by @AlpinDale in #633
fix: default api port and attention selector by @AlpinDale in #634
fix: clean up incorrect log in worker by @AlpinDale in #636
bump to v0.6.0.post1 by @AlpinDale in #635

New Contributors

@Trapper4888 made their first contribution in #627

Full Changelog: v0.6.0...v0.6.0.post1

Contributors

AlpinDale and Trapper4888

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Aphrodite Engine - v0.6.1

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Releases: PygmalionAI/aphrodite-engine

v0.6.5

What's Changed

Contributors

v0.6.4.post1

What's Changed

New Contributors

Contributors

v0.6.4

What's Changed

New Contributors

Contributors

v0.6.3.post1

What's Changed

New Contributors

Contributors

v0.6.3

What's Changed

Contributors

v0.6.2.post1

What's Changed

New Contributors

Contributors

v0.6.2

What's Changed

New Contributors

Contributors

v0.6.1.post1

What's Changed

Contributors

v0.6.1

Aphrodite Engine - v0.6.1

What's Changed

Contributors

v0.6.0.post1

What's Changed

New Contributors

Contributors