Skip to content

Releases: PygmalionAI/aphrodite-engine

v0.4.9

03 Feb 16:40
Compare
Choose a tag to compare

Another hotfix to follow v0.4.8. Fixed issues with GGUF, and included the hadamard tensors back in the wheel.

Full Changelog: v0.4.8...v0.4.9

v0.4.8

03 Feb 08:43
Compare
Choose a tag to compare

Full Changelog: v0.4.7...v0.4.8

Quick hotfix to v0.4.7, as it wasn't including LoRAs in the wheel.

v0.4.7

03 Feb 06:28
8da2be0
Compare
Choose a tag to compare

What's Changed

Lots of new additions, after a long time.

New features and additions

  • Dynamic Temperature. (@StefanGliga)
  • Switch from Ray to NCCL for control-plane communications. Massive speedup for parallelism
  • Support for prefix caching. Needs to be sent as prefix + prompt. Not in API servers yet
  • Support for S-LoRA. Basically, load multiple LoRAs and pick one for inference. Not in the API servers yet
  • Speed up AWQ throughput with new kernels - close to GPTQ speeds now
  • Custom all-reduce kernels for parallelism. Massive improvements to throughput - parallel setups are now faster than single-gpu setups even at low batch sizes
  • Add Context-Free Grammar support. EBNF format is currently supported
  • Add GGUF support
  • Add QuIP# support
  • Add Marlin support
  • Add /metrics for Kobold server
  • Add Quadratic Sampler
  • Add Deepseek-MoE support with fused kernels
  • Add Grafana + Prometheus production monitoring support

Bug Fixes and Small Optimizations

  • Fix temperature always being set to 1
  • Logprobs would crash the server if it contained NaN or -inf (@miku448)
  • Switch to deques in the scheduler instead of lists. Reduces complexity from quadratic to linear
  • Fix eager_mode performance by not excessively padding for every iteration
  • Optimize memory usage with CUDA graph by tying max_num_seqs with the captured batch size. Lower the value to lower memory usage
  • Both safetensors and pytorch bins were being downloaded
  • Fix crash with max_tokens=None
  • Fix multi-gpu on WSL
  • Fix some outputs returning token_id=0 at high concurrency (@50h100a)

v0.4.6

14 Jan 08:03
9f77f35
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.4.5...v0.4.6

v0.4.5

19 Dec 15:42
Compare
Choose a tag to compare

What's Changed

Important

Version 0.4.4 was skipped.

Quite a few changes this time around, most notably:

  • Implement DeciLM by @AlpinDale in #158
  • Support prompt logprobs by @AlpinDale in #162
  • Support safetensors for Mixtral along with expert parallelism for better multi-gpu by @AlpinDale in #167
  • Implement CUDA graphs for better multi-GPU and optimizing smaller models by @AlpinDale in #172
  • Fix peak memory profiling to allow higher gmu values by @AlpinDale in #166
  • Restore compatibility with Python 3.8 and 3.9 by @g4rg in #170
  • Lazily import model classes to avoid import overhead by @AlpinDale in #165
  • Add RoPE scaling support for Mixtral models by @g4rg in #174
  • Make OpenAI API keys optional by @AlpinDale in #176

Full Changelog: v0.4.4...v0.4.5

v0.4.3

12 Dec 16:06
Compare
Choose a tag to compare

This is a big release! We've had many new and exciting changes.

What's New

NOTE: You'll need to run pip install megablocks if you're using the wheels.

New Contributors

Full Changelog: v0.4.2...v0.4.3

v0.4.2

13 Nov 16:56
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.4.1...v0.4.2

v0.4.1

03 Nov 18:19
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.4...v0.4.1

v0.4

03 Nov 12:53
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.3.7...v0.4

v0.3.7

24 Oct 08:23
Compare
Choose a tag to compare

What's Changed

  • fix: prompt processing overhead introduced by #66 by @AlpinDale in #71
  • fix: launch AWQ kernels on the current CUDAStream by @AlpinDale in #75
  • Added min_tokens and reimplemented ignore_eos using a new logit processor by @50h100a in #70
  • feat: add PagedAttention V2 kernels by @AlpinDale in #76
  • feat:Enable banning tokens by @StefanGliga in #80

Full Changelog: v0.3.6...v0.3.7