Releases: scaleapi/llm-engine
Releases · scaleapi/llm-engine
v0.0.0beta34
What's Changed
- Hardcode llama 3 70b endpoint param by @yunfeng-scale in #524
- Don't fail checking GPU memory by @yunfeng-scale in #525
- Option to read Redis URL from AWS Secret by @seanshi-scale in #526
- Fix formatting on completions documentation guide by @saiatmakuri in #527
- Higher priority for gateway by @yunfeng-scale in #529
- Non-interactive installation during docker build by @yunfeng-scale in #533
- [Client] Add guided_grammar and other missing fields by @seanshi-scale in #532
Full Changelog: v0.0.0beta33...v0.0.0beta34
v0.0.0beta33
What's Changed
- Necessary Changes for long context llama-3-8b by @sam-scale in #516
- Increase max gpu utilization for 70b models by @dmchoiboi in #517
- Infer hardware from model name by @yunfeng-scale in #515
- Improve TensorRT-LLM Functionality by @seanshi-scale in #487
- Upgrade vLLM version for batch completion by @dmchoiboi in #518
- Revert "Upgrade vLLM version for batch completion" by @dmchoiboi in #520
- Allow H100 to be used by @yunfeng-scale in #522
- vLLM version 0.4.2 Docker image by @squeakymouse in #521
- Image cache and balloon on H100s, also temporarily stop people from using A100 by @yunfeng-scale in #523
Full Changelog: v0.0.0beta32...v0.0.0beta33
v0.0.0beta32
What's Changed
- Add emitting token count metrics to datadog statsd by @seanshi-scale in #458
- Downgrade sse-starlette version by @squeakymouse in #478
- Return 400 for botocore client errors by @yunfeng-scale in #479
- Increase Kaniko Memory by @saiatmakuri in #481
- Batch job metrics by @yunfeng-scale in #480
- Use base model name as metric tag by @yunfeng-scale in #483
- Change LLM Engine base path from global var by @squeakymouse in #482
- Remove fine-tune limit for internal users by @squeakymouse in #484
- Parallel Python execution for tool completion by @yunfeng-scale in #470
- Allow JSONL for fine-tuning datasets by @squeakymouse in #486
- Fix throughput_benchmarks ITL calculation, add option to use a json file of prompts by @seanshi-scale in #485
- Add Model.update() to Python client by @squeakymouse in #490
- Bump idna from 3.4 to 3.7 in /clients/python by @dependabot in #491
- Bump idna from 3.4 to 3.7 in /model-engine by @dependabot in #492
- Properly add mixtral 8x22b by @yunfeng-scale in #493
- support mixtral 8x22b instruct by @saiatmakuri in #495
- fix return_token_log_probs on vLLM > 0.3.3 endpoints by @saiatmakuri in #498
- Package update + more docs on dev setup by @dmchoiboi in #500
- Add Llama 3 models by @yunfeng-scale in #501
- Enforce model checkpoints existing for endpoint/bundle creation by @dmchoiboi in #503
- guided decoding with grammar by @saiatmakuri in #488
- adding asyncenginedead error catch by @ian-scale in #504
- Default include_stop_str_in_output to None by @squeakymouse in #506
- get latest inference framework tag from configmap by @saiatmakuri in #505
- integration tests for completions by @saiatmakuri in #507
- patch service config identifier by @saiatmakuri in #509
- require safetensors for LLM endpoint creation by @saiatmakuri in #510
- Add py.typed for proper typechecking support on clients by @dmchoiboi in #513
- Fix package name mapping in setup.py by @dmchoiboi in #514
New Contributors
- @dmchoiboi made their first contribution in #500
Full Changelog: v0.0.0beta28...v0.0.0beta32
v0.0.0beta28
What's Changed
- Tool completion respect num new tokens by @yunfeng-scale in #469
- Azure fixes + additional asks by @squeakymouse in #468
- Metrics for stuck async requests by @squeakymouse in #471
- Fix cacher by @yunfeng-scale in #472
- Add retries to deflake integration tests by @squeakymouse in #473
- add suffix to integration tests by @saiatmakuri in #474
- fix docs tests gateway endpoint by @saiatmakuri in #475
- Guided decoding by @yunfeng-scale in #476
Full Changelog: v0.0.0beta27...v0.0.0beta28
v0.0.0beta27
What's Changed
- Try to fix async requests getting stuck by @squeakymouse in #466
- [Client] Add num_prompt_tokens to the client's CompletionOutputs by @seanshi-scale in #467
Full Changelog: v0.0.0beta26...v0.0.0beta27
v0.0.0beta26
What's Changed
- [SC-836587] Pin boto3 and urllib3 versions to fix error in inference image by @edgan8 in #432
- include stop string in completions output by @saiatmakuri in #435
- Logging post inference hook implementation by @tiffzhao5 in #428
- add codellama-70b models by @saiatmakuri in #436
- Add hook validation and support logging for python client by @tiffzhao5 in #437
- Azure refactor for async endpoints by @squeakymouse in #425
- Remove post inference hook handling in main container by @tiffzhao5 in #438
- Clean up logs for logging hook by @tiffzhao5 in #439
- Fix Infra Task Gateway by @saiatmakuri in #443
- support gemma models by @saiatmakuri in #444
- Fix infra config dependency by @squeakymouse in #449
- Add emitted timestamp for logging by @tiffzhao5 in #450
- Change cache update time for async endpoint integration test by @tiffzhao5 in #451
- Bump aiohttp from 3.9.1 to 3.9.2 in /model-engine by @dependabot in #446
- Bump python-multipart from 0.0.6 to 0.0.7 in /model-engine by @dependabot in #447
- Bump gitpython from 3.1.32 to 3.1.41 in /model-engine by @dependabot in #453
- Log endpoint in sensitive_log_mode by @squeakymouse in #455
- Bump orjson from 3.8.6 to 3.9.15 in /model-engine by @dependabot in #456
- Allow the load test script to use a csv of inputs by @seanshi-scale in #440
- add some debugging to vllm docker by @yunfeng-scale in #454
- Add product label validation by @edgan8 in #442
- Add log statement for gateway sending async task by @tiffzhao5 in #459
- Some batch inference improvements by @yunfeng-scale in #460
- Fix cacher by @yunfeng-scale in #462
- Fix vllm batch docker image by @yunfeng-scale in #463
- Add tool completion to batch inference by @yunfeng-scale in #461
- fix llm-engine finetune.create failures by @ian-scale in #464
- Change back batch infer GPU util and add tool completion client changes by @yunfeng-scale in #465
New Contributors
Full Changelog: v0.0.0beta25...v0.0.0beta26
v0.0.0beta25
What's Changed
- LLM benchmark script improvements by @seanshi-scale in #427
- Allow using pydantic v2 by @seanshi-scale in #429
- Fix helm chart nodeSelector for GPU endpoints by @squeakymouse in #430
- Allow pydantic 2 in python client requested requirements by @seanshi-scale in #433
- Fix batch job permissions by @yunfeng-scale in #431
- [Client] Add Auth headers to the python async routes by @seanshi-scale in #434
Full Changelog: v0.0.0beta22...v0.0.0beta25
v0.0.0beta22
What's Changed
- Change middleware format by @squeakymouse in #393
- Fix custom framework Dockerfile by @squeakymouse in #395
- fixing tensorrt-llm enum value (fixes #390) by @ian-scale in #396
- overriding model length for zephyr 7b alpha by @ian-scale in #398
- time completions use case by @saiatmakuri in #397
- update docs to show model len / context windows by @ian-scale in #401
- Add MultiprocessingConcurrencyLimiter to gateway by @squeakymouse in #399
- change code-llama to codellama by @ian-scale in #400
- fix completions request id by @saiatmakuri in #402
- Allow latest inference framework tag by @squeakymouse in #403
- Bump helm chart version by @seanshi-scale in #406
- 4x sqlalchemy pool size by @yunfeng-scale in #405
- bump datadog module to 0.47.0 by @saiatmakuri in #407
- Fix autoscaler node selector by @seanshi-scale in #409
- Log request sizes by @yunfeng-scale in #410
- add support for mixtral-8x7b and mixtral-8x7b-instruct by @saiatmakuri in #408
- Make sure metadata is not incorrectly wiped during endpoint update by @yunfeng-scale in #413
- Always return output for completions sync response by @yunfeng-scale in #412
- handle update endpoint errors by @saiatmakuri in #414
- [bug-fix] LLM Artifact Gateway .list_files() by @saiatmakuri in #416
- enable sensitive log mode by @song-william in #415
- Throughput benchmark script by @yunfeng-scale in #411
- Upgrade vllm to 0.2.7 by @yunfeng-scale in #417
- LLM batch completions API by @yunfeng-scale in #418
- Small update to vllm batch by @yunfeng-scale in #419
- sensitive content flag by @yunfeng-scale in #421
- Revert a broken refactoring by @yunfeng-scale in #423
- [Logging I/O] Post inference hooks as background tasks by @tiffzhao5 in #422
- Batch inference client / doc by @yunfeng-scale in #424
- Minor fixes for batch inference by @yunfeng-scale in #426
Full Changelog: v0.0.0beta20...v0.0.0beta22
v0.0.0beta20
What's Changed
- Patch
post_file
client method by @song-william in #323 - Add pod disruption budget to all endpoints by @yunfeng-scale in #328
- create celery worker with inference worker profile by @saiatmakuri in #327
- Bump http forwarder request CPU by @yunfeng-scale in #330
- [Docs] Clarify get-events API usage by @seanshi-scale in #320
- Enable additional Datadog tagging for jobs by @song-william in #324
- fix celery worker profile for s3 access by @saiatmakuri in #333
- Hardcode number of forwarder workers by @yunfeng-scale in #334
- Standardize logging initialization by @song-william in #337
- Fix up the mammoth max length issue. by @sam-scale in #335
- Add docs for Model.create, update default values and fix per_worker concurrency by @yunfeng-scale in #332
- updating docs to add codellama models by @ian-scale in #343
- Add PodDisruptionBudget to model engine by @yunfeng-scale in #342
- Allow auth to accept API keys by @saiatmakuri in #326
- Add job_name in build logs for easier debugging by @song-william in #340
- Make PDB optional by @yunfeng-scale in #344
- Revert "fix celery worker profile for s3 access" by @yixu34 in #345
- Revert "Revert "fix celery worker profile for s3 access"" by @saiatmakuri in #346
- Pass file ID to fine-tuning script by @squeakymouse in #347
- llama should have None max length by @sam-scale in #348
- taking out codellama13b and 34b by @ian-scale in #349
- Change DATADOG_TRACE_ENABLED to DD_TRACE_ENABLED by @edwardpark97 in #350
- Allow fine-tuning hyperparameter to be Dict by @squeakymouse in #353
- adding real auth to integration tests by @ian-scale in #352
- add new llm-jp models to llm-engine by @ian-scale in #354
- Generalize SQS region by @jaisanliang in #355
- Track LLM Metrics by @saiatmakuri in #356
- Remove extra trace facet "launch.resource_name" by @saiatmakuri in #359
- Ianmacleod/add codellama instruct, refactor codellama models by @ian-scale in #360
- Various changes/bugfixes to chart/code to streamline deployment on different forms of infra by @seanshi-scale in #339
- Add PR template by @song-william in #341
- Unmount aws config from root by @song-william in #361
- Implement automated code coverage for CI by @tiffzhao5 in #362
- Download only known files by @squeakymouse in #364
- Documentation fix by @squeakymouse in #365
- Change more AWS config mount paths by @squeakymouse in #367
- Validating inference framework image tags by @tiffzhao5 in #357
- Ianmacleod/add codellama 34b by @ian-scale in #369
- Better error when model is not ready for predictions by @tiffzhao5 in #368
- Improve metrics route team tags by @saiatmakuri in #371
- Enable custom istio metric tags with Telemetry API by @song-william in #373
- Use Variable name for Telemetry Helm Resources by @song-william in #374
- Forward HTTP status code for sync requests by @yunfeng-scale in #375
- Integrate TensorRT-LLM by @yunfeng-scale in #358
- Fine-tuning e2e integration test by @tiffzhao5 in #372
- Found a bug in the codellama vllm model_len logic. by @sam-scale in #380
- Fix sample.yaml by @yunfeng-scale in #381
- count prompt tokens by @saiatmakuri in #366
- Fix integration test by @yunfeng-scale in #383
- emit metrics on token counts by @saiatmakuri in #382
- Increase llama-2 max_input_tokens by @sam-scale in #384
- Revert "Found a bug in the codellama vllm model_len logic." by @yunfeng-scale in #386
- Some updates to integration tests by @yunfeng-scale in #385
- Celery autoscaler by @squeakymouse in #378
- Don't install Celery autoscaler for test deployments by @squeakymouse in #388
- LLM update API route by @squeakymouse in #387
- adding zephyr 7b by @ian-scale in #389
- update tensor-rt llm in enum by @ian-scale in #390
- pypi version bump by @ian-scale in #391
New Contributors
- @edwardpark97 made their first contribution in #350
- @jaisanliang made their first contribution in #355
- @tiffzhao5 made their first contribution in #362
Full Changelog: v0.0.0beta19...v0.0.0beta20
v0.0.0beta19
What's Changed
- Increase graceful timeout and hardcode AWS_PROFILE by @squeakymouse in #306
- bump pypi version by @ian-scale in #303
- Ianmacleod/add mistral by @ian-scale in #307
- Ianmacleod/add falcon 180b by @ian-scale in #309
- update 180b inference framework by @ian-scale in #310
- Adding code llama to TGI by @mfagundo-scale in #311
- Add AWQ enum by @yunfeng-scale in #317
- Fix documentation to reference Files API by @squeakymouse in #312
- Return TGI errors by @yunfeng-scale in #313
- Fix streaming endpoint failure handling by @yunfeng-scale in #314
- Validate quantization by @yunfeng-scale in #315
- Properly return PENDING status for docker image batch jobs/fine tune jobs by @seanshi-scale in #318
- add user_id and team_id as log facets by @song-william in #321
- publish 0.0.0b19 by @yunfeng-scale in #322
New Contributors
- @mfagundo-scale made their first contribution in #311
Full Changelog: v0.0.0beta18...v0.0.0beta19