[0.6.0] Release Candidate #481

AlpinDale · 2024-05-25T22:39:26Z

Target all PRs to this branch.

Warning

Highly experimental branch. A lot of things may not work.

Currently, the following are dysfunctional in this branch:

GGUF (feat: re-add GGUF #600)
ExLlamaV2
SmoothQuant+

* feat: massive api server refactoring * fix: tokenizer endpoint issues * fix: BatchResponseData body should be optional

…uired feature space

* add utils for getting the partition offset and size for current tp rank * disable asymmetric TP for quants and lora, handle GQA allocation * the actual splitting work in the linear layers * padding size for the vocab/lm_head should be optional * cache engine and spec decode model runner (kwargs only) * pass the tp_rank to model runners * llama support

This reverts commit 6dd6408.

…ling and non-beam search usecase (#616)

* refactor gguf kernels * fix: incorrect filename for vecdotq header * finish up the re-impl * add requirements

…618) * add getting started page * add debugging tips * add openai docs * add distributed guide * add production metrics and model support matrix * add guide for adding new models * huge update * add vlm usage docs

AlpinDale mentioned this pull request Aug 5, 2024

[Bug]: Problem loading EXL2 in rc_054 #561

Open

AlpinDale and others added 29 commits August 22, 2024 14:11

chore: skip the driver worker

23408b9

chore: bump lmfe version to 0.10.3

34fc26c

chore: some more marlin cleanups

7e9d4f3

chore: deprecation warning for beam search

bf15e1b

feat: support FP8 for DeepSeekV2 MoE

1efd0f8

feat: add fuyu vision model and persimmon language model support

e13a669

fix: turn off cutlass scaled_mm for ada lovelace cards

e7e847c

chore: allow quantizing all layers of deepseek-v2

b82c397

fix: build with pylimited api in the docker file

156a249

OpenAI API Refactor (#591)

cf381a0

* feat: massive api server refactoring * fix: tokenizer endpoint issues * fix: BatchResponseData body should be optional

chore: simplify pipeline parallel code in llama

497bf64

fix: convert image to RGB by default

d6bf4bc

fix: allow getting the chat template from a url

96d5b8c

chore: avoid loading the unused layers and init the VLM up to the req…

e26a4ac

…uired feature space

chore: enable bias w/ FP8 layers in CUTLASS kernels

b5d23ab

chore: upgrade flashinfer to 0.0.9

f83bbc6

feat: add custom triton cache manager

c8d398a

chore: add CustomAP interface to UnquantizedFusedMoEMethod

4bbf664

chore: handle aborted requests for jamba

e76bbe7

fix: minor fix for prompt adapter config

90b2f79

feat: chat completions tokenization endpoint (#592)

8432cae

feat: optimize throughput to 1.4x by using numpy for token padding

ebf8a53

feat: MoE support with Pallas GMM kernel for TPUs

e1475fb

chore: log spec decoding metrics

fc38c74

chore: separate kv_scale into k_scale and v_scale

9d7beaa

update dockerfile

6dd6408

let's not build these for now

321a089

Revert "update dockerfile"

b71a865

This reverts commit 6dd6408.

AlpinDale and others added 28 commits September 2, 2024 03:18

chore: simplify output processing with shortcut for non-parallel samp…

29f0478

…ling and non-beam search usecase (#616)

refactor: minicpmv and port Idefix2VisionTransformer

9a50e3b

refactor: factor out code for running uvicorn again

040e5af

feat: port SiglipVisionModel from transformers

c3ee71a

chore: add proper logging for spec decoding verification

edffcec

fix: support flashinfer for draft model runner

6c2e24d

fix: use ipv4 localhost form for zmq bind

e8008f2

fix: use args.trust_remote_code

bd210a6

chore: update cutlass to 3.5.1

9d98f29

fix: specify device when loading lora and embedding tensors

2a349ca

feat: non-blocking transfer in prepare_input

6c1eab6

feat: re-add GGUF (#600)

0e6c400

* refactor gguf kernels * fix: incorrect filename for vecdotq header * finish up the re-impl * add requirements

minor CI fixes

e63be8e

ci: a few more ignores

dc00aa7

ci: remove clang-format

6b1f965

ci: take one of fixing lint issues

4d4e767

ci: codespell fixes

f18eeaf

ci: remove yapf

616de67

ci: remove yapf from the formatting script

c4933b1

ci: remove isort

2424207

chore: minor cleanups

2b85ffb

fix: allow loading GGUF model without .gguf extension

2894676

fix: cpu offloading with gptq

208cd54

ci: add action for deploying docs

cc7a636

docs: fix typos

638784c

chore: refactor wheel build script

75122b2

bump version to 0.6.0

c1c37c7

AlpinDale changed the title ~~[0.5.4] Release Candidate~~ [0.6.0] Release Candidate Sep 3, 2024

AlpinDale merged commit f1d0b77 into main Sep 3, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.6.0] Release Candidate #481

[0.6.0] Release Candidate #481

AlpinDale commented May 25, 2024 •

edited

Loading

[0.6.0] Release Candidate #481

[0.6.0] Release Candidate #481

Conversation

AlpinDale commented May 25, 2024 • edited Loading

AlpinDale commented May 25, 2024 •

edited

Loading