-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Upstream test PR #322
Commits on Sep 3, 2024
-
Remove mark step from static MoE loop (#231)
Removes unnecessary mark step from MoE OP loop to speed up computation
Configuration menu - View commit details
-
Copy full SHA for b4f6a29 - Browse repository at this point
Copy the full SHA b4f6a29View commit details -
Configuration menu - View commit details
-
Copy full SHA for 733524a - Browse repository at this point
Copy the full SHA 733524aView commit details -
Signed-off-by: Chendi.Xue <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fb98cad - Browse repository at this point
Copy the full SHA fb98cadView commit details
Commits on Sep 4, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 49ffde6 - Browse repository at this point
Copy the full SHA 49ffde6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 538c8f1 - Browse repository at this point
Copy the full SHA 538c8f1View commit details -
Enable llama-405b - w/a for memory allocation error (#184)
Work around for allocation error while loading llama-405b.
Configuration menu - View commit details
-
Copy full SHA for 691255b - Browse repository at this point
Copy the full SHA 691255bView commit details -
[bugfix] handle large bucket minimums correctly (#235)
This bugfix addresses incorrect lower boundary handling for bucketing Previous behavior: ``` INFO 09-03 19:36:28 habana_model_runner.py:564] Prompt bucket config (min, step, max_warmup) bs:[64, 32, 64], seq:[768, 128, 768] INFO 09-03 19:36:28 habana_model_runner.py:577] Generated 12 prompt buckets: [(32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768)] INFO 09-03 19:36:28 habana_model_runner.py:582] Omitted 0 prompt buckets due to exceeded token budget (max_num_batched_tokens=131072) INFO 09-03 19:36:28 habana_model_runner.py:590] Decode bucket config (min, step, max_warmup) bs:[64, 128, 64], seq:[768, 128, 1024] INFO 09-03 19:36:28 habana_model_runner.py:601] Generated 8 decode buckets: [(64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024)] INFO 09-03 19:36:28 habana_model_runner.py:606] Omitted 0 decode buckets due to exceeded token budget (max_num_batched_tokens=131072) ``` Min seq len dimension is set to 768, but buckets with seq_len=128-768 are present Current behavior: ``` INFO 09-03 19:45:42 habana_model_runner.py:563] Prompt bucket config (min, step, max_warmup) bs:[64, 32, 64], seq:[768, 128, 768] INFO 09-03 19:45:42 habana_model_runner.py:576] Generated 1 prompt buckets: [(64, 768)] INFO 09-03 19:45:42 habana_model_runner.py:581] Omitted 0 prompt buckets due to exceeded token budget (max_num_batched_tokens=131072) INFO 09-03 19:45:42 habana_model_runner.py:589] Decode bucket config (min, step, max_warmup) bs:[64, 128, 64], seq:[768, 128, 1024] INFO 09-03 19:45:42 habana_model_runner.py:600] Generated 3 decode buckets: [(64, 768), (64, 896), (64, 1024)] INFO 09-03 19:45:42 habana_model_runner.py:605] Omitted 0 decode buckets due to exceeded token budget (max_num_batched_tokens=131072) ``` No bucket with seq_len < 768 is captured
Configuration menu - View commit details
-
Copy full SHA for a4e1d52 - Browse repository at this point
Copy the full SHA a4e1d52View commit details -
fix guided_decode HPU failing issue
Signed-off-by: Chendi.Xue <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8046d81 - Browse repository at this point
Copy the full SHA 8046d81View commit details
Commits on Sep 5, 2024
-
Remove token budget from decode buckets (#241)
This PR prevents max_num_batched_tokens from limiting decode buckets, as decode buckets should be limited by number of blocks, not by max_num_batched_tokens.
Configuration menu - View commit details
-
Copy full SHA for 7cd226c - Browse repository at this point
Copy the full SHA 7cd226cView commit details -
[habana_main bugfix] Fix min bucket boundary calculation (#239)
Ports #97 to habana_main
Configuration menu - View commit details
-
Copy full SHA for d0eb7d7 - Browse repository at this point
Copy the full SHA d0eb7d7View commit details -
Mask based BGMV implementation (#223)
Refactors BGMV implementation from gather based to mask-based to optimize performance and reduce device memory usage.
Configuration menu - View commit details
-
Copy full SHA for 05acb89 - Browse repository at this point
Copy the full SHA 05acb89View commit details
Commits on Sep 6, 2024
-
Configuration menu - View commit details
-
Copy full SHA for d2e2854 - Browse repository at this point
Copy the full SHA d2e2854View commit details -
Configuration menu - View commit details
-
Copy full SHA for 97bd0fd - Browse repository at this point
Copy the full SHA 97bd0fdView commit details -
Configuration menu - View commit details
-
Copy full SHA for ededdaf - Browse repository at this point
Copy the full SHA ededdafView commit details -
Configuration menu - View commit details
-
Copy full SHA for b507cc4 - Browse repository at this point
Copy the full SHA b507cc4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 016f343 - Browse repository at this point
Copy the full SHA 016f343View commit details -
Use all possible slot values for dummy blocks to avoid caching issues.
Configuration menu - View commit details
-
Copy full SHA for d9fa7cf - Browse repository at this point
Copy the full SHA d9fa7cfView commit details -
Use PT_COMPILE_ONLY_MODE during warmup (#227)
With PT_COMPILE_ONLY_MODE flag, graphs can be compiled without performing synLaunch. The flag has been added to the warmup phase to decrease its execution time.
Configuration menu - View commit details
-
Copy full SHA for 7488c58 - Browse repository at this point
Copy the full SHA 7488c58View commit details -
Do not pass warmup_mode to execute_model_kwargs (#229)
This fixes a very silly issue where mismatching values of `warmup_mode` flag could cause graph recompilations and eventually memory leaks.
Configuration menu - View commit details
-
Copy full SHA for 17447ed - Browse repository at this point
Copy the full SHA 17447edView commit details -
Add error handling for PT_COMPILE_ONLY_MODE (#251)
This PR fixes crashes observed on older Synapse builds introduced with #227. Setting PT_COMPILE_ONLY_MODE is not supported in current or older public Synapse builds, but we should not crash because of it, rather we should advise user to use the latest build. Previous behavior: ``` ... INFO 09-06 17:08:37 habana_executor.py:85] # HPU blocks: 10761, # CPU blocks: 910 INFO 09-06 17:08:37 habana_worker.py:201] Initializing cache engine took 47.29 GiB of device memory (54.34 GiB/94.62 GiB used) and -159.6 MiB of host memory (414.9 GiB/1007 GiB used) [rank0]: Traceback (most recent call last): [rank0]: File "/software/users/kzawora/vllm-utils/vllm_hpu_simple_test.py", line 9, in <module> [rank0]: llm = LLM(model="facebook/opt-125m") [rank0]: File "/software/users/kzawora/vllm-fork/vllm/entrypoints/llm.py", line 155, in __init__ [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 456, in from_engine_args [rank0]: engine = cls( [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 266, in __init__ [rank0]: self._initialize_kv_caches() [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 378, in _initialize_kv_caches [rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/executor/habana_executor.py", line 89, in initialize_cache [rank0]: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_worker.py", line 202, in initialize_cache [rank0]: self._warm_up_model() [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_worker.py", line 220, in _warm_up_model [rank0]: self.model_runner.warmup_model(self.hpu_cache[0]) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_model_runner.py", line 1412, in warmup_model [rank0]: with compile_only_mode_context(): [rank0]: File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ [rank0]: return next(self.gen) [rank0]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/internal/bridge_config.py", line 20, in env_setting [rank0]: get_func = globals()['get_' + var.lower()] [rank0]: KeyError: 'get_pt_compile_only_mode' inc shutdown inc shutdown inc shutdown inc shutdown ``` Current behavior: ``` ... INFO 09-06 17:06:42 habana_executor.py:85] # HPU blocks: 10761, # CPU blocks: 910 INFO 09-06 17:06:43 habana_worker.py:201] Initializing cache engine took 47.29 GiB of device memory (54.34 GiB/94.62 GiB used) and -143.7 MiB of host memory (415 GiB/1007 GiB used) WARNING 09-06 17:06:43 habana_model_runner.py:1419] Cannot use PT_COMPILE_ONLY_MODE. Warmup time will be negatively impacted. Please update Gaudi Software Suite. INFO 09-06 17:06:43 habana_model_runner.py:1336] [Warmup][Prompt][1/23] batch_size:2 seq_len:1024 free_mem:40.28 GiB ... ```
Configuration menu - View commit details
-
Copy full SHA for b50aa14 - Browse repository at this point
Copy the full SHA b50aa14View commit details
Commits on Sep 9, 2024
-
Hardcode fastapi version due to pydantic error (#255)
Fixes serving mode issue; due to error in fastapi
Configuration menu - View commit details
-
Copy full SHA for 00f1333 - Browse repository at this point
Copy the full SHA 00f1333View commit details -
Mask based BGMV implementation for LoRA Embedding (#247)
This PR contains mask based BGMV implementation for LoRA embedding instead of index-select of LoRA-B weights. Removing special handling in no LoRA case also.
Configuration menu - View commit details
-
Copy full SHA for b764610 - Browse repository at this point
Copy the full SHA b764610View commit details -
Eliminate graph breaks for torch.compile mode (#202)
Eliminate two graph breaks for torch.compile mode: 1. [__graph_breaks] torch._dynamo.exc.Unsupported: builtin: eq [<class 'torch._dynamo.variables.misc.GetAttrVariable'>, <class 'torch._dynamo.variables.constant.EnumVariable'>] False 2. [__graph_breaks] torch._dynamo.exc.Unsupported: Tensor.item --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details> --------- Signed-off-by: yuwenzho <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 73af823 - Browse repository at this point
Copy the full SHA 73af823View commit details
Commits on Sep 10, 2024
-
Port flat PA from habana_next to habana_main (#169)
FILL IN THE PR DESCRIPTION HERE FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details> --------- Co-authored-by: Michal Adamczyk <[email protected]> Co-authored-by: barak goldberg <[email protected]> Co-authored-by: Michal Szutenberg <[email protected]> Co-authored-by: Jan Kaniecki <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5cf8441 - Browse repository at this point
Copy the full SHA 5cf8441View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2fed15b - Browse repository at this point
Copy the full SHA 2fed15bView commit details -
Configuration menu - View commit details
-
Copy full SHA for f74fe23 - Browse repository at this point
Copy the full SHA f74fe23View commit details -
Configuration menu - View commit details
-
Copy full SHA for e2c8b5a - Browse repository at this point
Copy the full SHA e2c8b5aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4194195 - Browse repository at this point
Copy the full SHA 4194195View commit details -
Add disable_tensor_cache=True to HPUGraph capture (#252)
RuntimeErrors are not observed anymore on habana_main when disable_tensor_cache is used. This PR enables disable_tensor_cache.
Configuration menu - View commit details
-
Copy full SHA for 4052bdb - Browse repository at this point
Copy the full SHA 4052bdbView commit details -
Configuration menu - View commit details
-
Copy full SHA for c9bf908 - Browse repository at this point
Copy the full SHA c9bf908View commit details -
On habana_main the slots are calculated by adding an offset to the block which breaks the check for _PAD_SLOT_ID. Reworked it so that in case of _PAD_BLOCK_ID we're automatically inserting the right value.
Configuration menu - View commit details
-
Copy full SHA for 69df1e7 - Browse repository at this point
Copy the full SHA 69df1e7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 53f96b7 - Browse repository at this point
Copy the full SHA 53f96b7View commit details -
Configuration menu - View commit details
-
Copy full SHA for d436d38 - Browse repository at this point
Copy the full SHA d436d38View commit details -
Configuration menu - View commit details
-
Copy full SHA for 61b6fbb - Browse repository at this point
Copy the full SHA 61b6fbbView commit details
Commits on Sep 11, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 2091161 - Browse repository at this point
Copy the full SHA 2091161View commit details -
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
…a/vllm_v0_6_0_rebase
Configuration menu - View commit details
-
Copy full SHA for c9bdcbe - Browse repository at this point
Copy the full SHA c9bdcbeView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8e41fb5 - Browse repository at this point
Copy the full SHA 8e41fb5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 68e0f57 - Browse repository at this point
Copy the full SHA 68e0f57View commit details -
Configuration menu - View commit details
-
Copy full SHA for b776d5e - Browse repository at this point
Copy the full SHA b776d5eView commit details -
Fix LoRA test by handling mask creation inside the test (#270)
This PR handles mask creation inside lora unit tests to align with new BGMV implementation
Configuration menu - View commit details
-
Copy full SHA for c0ff22f - Browse repository at this point
Copy the full SHA c0ff22fView commit details
Commits on Sep 12, 2024
-
Attn MetaData dtype should be same as model dtype (#271)
Attn MetaData was hard coded to bfloat16, leading to a runtime error for float32 model instantiation.
Configuration menu - View commit details
-
Copy full SHA for f858d43 - Browse repository at this point
Copy the full SHA f858d43View commit details -
Configuration menu - View commit details
-
Copy full SHA for acf7d54 - Browse repository at this point
Copy the full SHA acf7d54View commit details -
Fixed ALiB and [MPT-7B](https://www.databricks.com/blog/mpt-7b) model. Accuracy results comparing to CPU(collected using [EleutherAI](https://github.com/EleutherAI/lm-evaluation-harness)) | Tasks | CPU | HPU | | -------------- | ------ | ------ | | arc_challenge | 0.4224 | 0.4189 | | arc_easy | 0.6974 | 0.6999 | | hellaswag | 0.7603 | 0.7626 | | lambada_openai | 0.7306 | 0.7326 | | mmlu | 0.293 | 0.2925 | | winogrande | 0.6851 | 0.6811 |
Configuration menu - View commit details
-
Copy full SHA for 6a734f4 - Browse repository at this point
Copy the full SHA 6a734f4View commit details -
Update gaudi-installation.rst (#279)
Fixing ENV variables' names after flat-PA merge
Configuration menu - View commit details
-
Copy full SHA for 543bb6d - Browse repository at this point
Copy the full SHA 543bb6dView commit details -
Configuration menu - View commit details
-
Copy full SHA for c2c1e0f - Browse repository at this point
Copy the full SHA c2c1e0fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6b3503c - Browse repository at this point
Copy the full SHA 6b3503cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8535d53 - Browse repository at this point
Copy the full SHA 8535d53View commit details -
Configuration menu - View commit details
-
Copy full SHA for 27b618a - Browse repository at this point
Copy the full SHA 27b618aView commit details -
Remove hardcoded value from softmax in flat_pa (#280)
This PR removes the hardcoded value used to normalize softmax in flat_pa . Current approach is to use the global maximum as it is very easy to compute, but it has the drawback that other samples in a batch might slightly affect numerical stability. This is a first step to eliminated some of the INF/NaN issues we see in certain configurations and by no means this is a complete solutions. This needs to be revised in the future.
Configuration menu - View commit details
-
Copy full SHA for 35a4a98 - Browse repository at this point
Copy the full SHA 35a4a98View commit details -
Fix yapf detected format issue
Signed-off-by: Chendi.Xue <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 046cb25 - Browse repository at this point
Copy the full SHA 046cb25View commit details -
Signed-off-by: Chendi.Xue <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for aa4c59c - Browse repository at this point
Copy the full SHA aa4c59cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 181babf - Browse repository at this point
Copy the full SHA 181babfView commit details
Commits on Sep 13, 2024
-
Increase garbage collector's threshold (#281)
Increase garbage collector's threshold in order to reduce it's frequency
Configuration menu - View commit details
-
Copy full SHA for 88b06c2 - Browse repository at this point
Copy the full SHA 88b06c2View commit details -
[Bugfix][Habana_main] fix guided_decode HPU failing issue (#236)
FILL IN THE PR DESCRIPTION HERE FIX ##198 After this change, we can see tool_calls can be returned successfully ``` bash Compiling FSM index for all state transitions: 100%|████████████████████████████████████████████████████████████████████████| 55/55 [00:01<00:00, 32.86it/s]INFO 09-04 02:15:34 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 09-04 02:15:34 logger.py:36] Received request chat-0fd03b03ae05473488d9bce566401d91: prompt: "<|im_start|>user\nWhat's the weather like in Boston today?<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [27, 91, 318, 5011, 91, 29, 882, 198, 3923, 596, 279, 9282, 1093, 304, 10406, 3432, 76514, 91, 318, 6345, 91, 397, 27, 91, 318, 5011, 91, 29, 78191, 198], lora_request: None, prompt_adapter_request: None. INFO 09-04 02:15:34 async_llm_engine.py:173] Added request chat-0fd03b03ae05473488d9bce566401d91. INFO 09-04 02:15:36 async_llm_engine.py:140] Finished request chat-0fd03b03ae05473488d9bce566401d91. INFO: 127.0.0.1:40452 - "POST /v1/chat/completions HTTP/1.1" 200 OK Message: ChatCompletionMessage(content='', refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-af3eac9372144f959ed0df7e16cf5da4', function=Function(arguments='{ "location": "Boston, MA", "unit": "fahrenheit" }', name='get_current_weather'), type='function')]) ``` **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for 54c1688 - Browse repository at this point
Copy the full SHA 54c1688View commit details -
fix rotary embedding
rotary_dim
not equalhead_size
case (#245)FILL IN THE PR DESCRIPTION HERE for model(like chatglm2/3-6b) whose `rotary_dim` not equal to `head_size`, current code will crash due to dim not equal. #212 have a not robust enough fix. chatglm series could work, but chatglm2-6b result is not correct. this fix follow vllm rotary_embeding pytorch native impl. verified on chatglm2-6b and chatglm3-6b **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for 8a92591 - Browse repository at this point
Copy the full SHA 8a92591View commit details -
[Bugfix][Habana_main] - dbrx model and arctic model codes fix to remo…
…ve CUDA hardcode (#217) FILL IN THE PR DESCRIPTION HERE FIX #216 **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for ffa7174 - Browse repository at this point
Copy the full SHA ffa7174View commit details -
Add Dockerfile.hpu FIX #199 **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for f4ac1f9 - Browse repository at this point
Copy the full SHA f4ac1f9View commit details -
fix ruff detected format error
Signed-off-by: Chendi.Xue <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1a35da2 - Browse repository at this point
Copy the full SHA 1a35da2View commit details -
Signed-off-by: Chendi.Xue <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3b710a6 - Browse repository at this point
Copy the full SHA 3b710a6View commit details
Commits on Sep 16, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 5abe4d7 - Browse repository at this point
Copy the full SHA 5abe4d7View commit details
Commits on Sep 17, 2024
-
optimized topp/topk calculation (#195)
## One line description Use topk instead of sort for topp/topk calculation under certain conditions (scalar value of p and k). ## Details Instead of using `k` for topk, we use `_padded_k`, which is strictly larger than k and monotonically non decreasing. We need/use `_padded_k > k` for cases where the smallest value of the topk=k values has some values beyond k, (for example for [9,8,8,8,7,7,7], with k=3, we have [9,8,8,8], which is 4 instead of 3 values), To prevent excessive recompilations, anytime we require an expansion of `_padded_k` we increment with a fixed constant `_increment` (usually >1), to have a bucketed approach to prevent multiple shapes ### Basic outline 1. perform topk with `_padded_k` 2. find the "kth" value in each row (smallest number that will be in topk), this is variable `num_duplicates_of_smallest_of_topk` 3. find maximum of number of duplicates, this variable is `max_num_duplicates_of_smallest_of_topk` 4. check if `_padded_k` is big enough to contain `max_num_duplicates_of_smallest_of_topk`. if not, then expand `_padded_k`, and redo the topk again with expanded `_padded_k` 6. maskout the values that are extra in `_padded_k` 7. move to doing topp ## Perf benefit ### Using benchmark_throughput.py To check benefit of this PR, make following change in `benchmark_throughput.py`: ``` diff --git a/benchmarks/benchmark_throughput.py b/benchmarks/benchmark_throughput.py index ff33e3dc..3383dea8 100644 --- a/benchmarks/benchmark_throughput.py +++ b/benchmarks/benchmark_throughput.py @@ -116,8 +116,9 @@ def run_vllm( sampling_params.append( SamplingParams( n=n, - temperature=0.0 if use_beam_search else 1.0, - top_p=1.0, + temperature=1.0, #0.0 if use_beam_search else 1.0, + top_p=0.95, + top_k=20, use_beam_search=use_beam_search, ignore_eos=True, max_tokens=output_len, ``` `VLLM_SKIP_WARMUP=true VLLM_GRAPH_RESERVED_MEM=0.2 VLLM_GRAPH_PROMPT_RATIO=0.8 VLLM_DECODE_BS_BUCKET_MIN=1 VLLM_DECODE_BLOCK_BUCKET_STEP=64 VLLM_DECODE_BLOCK_BUCKET_MIN=64 python benchmark_throughput.py --model /root/sasarkar/llama3-8b/ --device hpu --seed 2024 --backend vllm --num-prompts 100 --dtype bfloat16 --input-len=256 --output-len=512` in the numbers below there is a **49%** increase in thruput in the case with warmup, and **30%** increase in thruput in the case without warmup #### with opt + warmup Processed prompts: 100%|█████████████████████████████████████████████████████████████████████| 100/100 [00:22<00:00, 4.37it/s, est. speed input: 1119.66 toks/s, output: 2239.33 toks/s] Throughput: 4.37 requests/s, 3354.58 tokens/s #### with opt + skip warmup Processed prompts: 100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:46<00:00, 2.17it/s, est. speed input: 556.32 toks/s, output: 1112.63 toks/s] Throughput: 2.17 requests/s, 1667.89 tokens/s #### without opt + warmup Processed prompts: 100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:34<00:00, 2.93it/s, est. speed input: 749.24 toks/s, output: 1498.48 toks/s] Throughput: 2.92 requests/s, 2245.74 tokens/s #### without opt + skip warmup Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 100/100 [00:59<00:00, 1.67it/s, est. speed input: 428.49 toks/s, output: 856.99 toks/s] Throughput: 1.67 requests/s, 1284.85 tokens/s ### Using server Client (Data collected by Peter) [baseline](https://github.com/HabanaAI/vllm-fork/commits/a7763a7a76b4531ed7907549724df2949d9225bf/) all numbers on 1.17-495 third column [branch ](https://github.com/HabanaAI/vllm-fork/commits/ae_benchmark_9_10_24/) | model | TP | baseline HPU thruput | baseline HPU + this PR thruput | baseline HPU + this PR + other opt | | -------- | ------- | ------- | ------- | ------- | | llama3 8b | 1 | 950 | 1296 | 1306 | | llama3 8b | 4 | 1347 | 1969 | 2077 | | llama3 70b | 4 | 368 | 394 | 394 | | qwen 72b | 4 | 731 | 726 | 815 | ### Without delayed sampling On habana_main f858d43 ```VLLM_GRAPH_RESERVED_MEM=0.2 VLLM_GRAPH_PROMPT_RATIO=0.8 VLLM_DECODE_BS_BUCKET_MIN=1 VLLM_DECODE_BLOCK_BUCKET_STEP=64 VLLM_DECODE_BLOCK_BUCKET_MIN=64 python benchmark_throughput.py --model /root/sasarkar/llama3-8b/ --device hpu --seed 2024 --backend vllm --num-prompts 100 --dtype bfloat16 --input-len=256 --output-len=512``` Without change Throughput: 3.32 requests/s, 2550.85 tokens/s With change: Throughput: 5.17 requests/s, 3967.58 tokens/s ## Extra Notes 1. Works only for "scalar" case, though it might be possible to extend the basic idea (topk instead of sort) for vector case as well. (Outline of this is: find max k in topk vector, then perform topk using that, etc. needs some bucketing possibly to prevent dyn shapes etc) 2. Need an additional check in `_init_sampling_tensors` to determine if its scalar case. This has a minor perf hit. ideally if someone could tell us that its a scalar from the top itself... 3. Some tradeoffs can be made, where we use a sufficiently large padded_k (which is still smaller than vocab size) from the beginning, and hope that every case lands within that bucket. Cases that wont land are expected to be very, very rare. For example if padded_k = max(2 * k, 100) is used, and k = say 50, then we need the smallest of the topk value to repeat 50 times with same probability, which is exceedingly unlikely. If we trade off this mathematical improbability, then we can do with only 1 topk op, which might be faster 4. There is a `fliplr` in the code, which could be removed, if we can compute reverse cumsum. however the formula for reverse cumsum as expressed [here ](pytorch/pytorch#33520), ` x + torch.sum(x, dim=1, keepdims=True) - torch.cumsum(x, dim=1)` is numerically unstable, because of the addition/subtraction. It works well enough on ints and large numbers, but not on small probability values. 5. The value of `k` affects the gains we might get from this. For example in the expt shown above, with k=20, thruput increases from 1284.85 to 1667.89 (30% gain). But if k = 2000, instead of 20, throughput increases from 1127.34 to 1289.26 (14% gain). Thus the gain % might decrease with increasing k, as asymptotically topk would probably converges to sort's performance for large k. However practically k is pretty small. 6. For larger models, the gains may be less, as they are more device bound probably 7. Cumsum may be taking long. Maybe try below. [Initial try](b392ff8) ``` import torch y = torch.tensor([[1,2,3], [4,5,6]]) mask1 = torch.tensor([[[1,0,0],[1,1,0],[1,1,1]], [[1,0,0],[1,1,0],[1,1,1]]]) torch.sum(y.unsqueeze(1)*mask1,2) ``` or ``` F.conv1d(torch.tensor([[[0,0,0,0,1,2,3,4,5]], [[0,0,0,0,6,7,8,9,10.0]]]), torch.ones([1,1,5], dtype=torch.float32)) ``` FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for 4c1ca3a - Browse repository at this point
Copy the full SHA 4c1ca3aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 1a712d5 - Browse repository at this point
Copy the full SHA 1a712d5View commit details -
[Bugfix][Habana_main] fix multi-modal model inference - tested with l…
…lava-1.5 (#283) FILL IN THE PR DESCRIPTION HERE FIX #282 (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for 44c4f93 - Browse repository at this point
Copy the full SHA 44c4f93View commit details -
Add fake HPU mode to Habana components with dummy habana_frameworks m…
…odule. (#250) Co-authored-by: Konrad Zawora <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a9de5ba - Browse repository at this point
Copy the full SHA a9de5baView commit details -
Update documentation on support of fp8 (#288)
Update documentation on support of fp8
Configuration menu - View commit details
-
Copy full SHA for d39298c - Browse repository at this point
Copy the full SHA d39298cView commit details -
Configuration menu - View commit details
-
Copy full SHA for ed19acd - Browse repository at this point
Copy the full SHA ed19acdView commit details -
Removed vllm.hpu directory and changed relevant imports (#291)
Moved files from vllm/hpu to another public repo: https://github.com/HabanaAI/vllm-hpu-extension It can be installed with pip install git+https://github.com/HabanaAI/vllm-hpu-extension.git
Configuration menu - View commit details
-
Copy full SHA for 6a96d9b - Browse repository at this point
Copy the full SHA 6a96d9bView commit details -
Reduce default value of VLLM_GRAPH_RESERVED_MEM to 0.1 (#292)
After #252, HPUGraph capture takes much less memory, and we can reduce the memory reserved for HPUGraphs. On Llama3.1-8b-Instruct (G2), capturing 100% of prefill and decode graphs on BS=256 now takes 1.566 GB of HBM, which is far less than 40% (~30 GB) we reserve by default. This results in lots of unused (==wasted) memory, which could be used instead for more KV cache blocks.
Configuration menu - View commit details
-
Copy full SHA for 47a89be - Browse repository at this point
Copy the full SHA 47a89beView commit details -
Configuration menu - View commit details
-
Copy full SHA for 18d6339 - Browse repository at this point
Copy the full SHA 18d6339View commit details
Commits on Sep 18, 2024
-
Fix minor logging issue in habana_model_runner.py (#294)
The original code doesn't print the default value correctly INFO 09-17 00:06:07 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:_**min**_) INFO 09-17 00:06:07 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_STEP=1 (default:_**step**_) INFO 09-17 00:06:07 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_MAX=1 (default:_**max**_) This change make it print the correct default value INFO 09-17 21:30:51 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:_**1**_) INFO 09-17 21:30:51 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_STEP=4 (default:_**32**_) INFO 09-17 21:30:51 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_MAX=4 (default:_**64**_)
Configuration menu - View commit details
-
Copy full SHA for 83b54e9 - Browse repository at this point
Copy the full SHA 83b54e9View commit details -
Fix blocks number calculation for Flat PA (#269)
Fix blocks number calculation for Flat PA via adding empty table_block (#158)
Configuration menu - View commit details
-
Copy full SHA for b62fba8 - Browse repository at this point
Copy the full SHA b62fba8View commit details
Commits on Sep 19, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 347f9c7 - Browse repository at this point
Copy the full SHA 347f9c7View commit details
Commits on Sep 20, 2024
-
Remove dummy seq group data creation from loop (#301)
Remove dummy seq metadata from loop for Flat PA fix
Configuration menu - View commit details
-
Copy full SHA for cd7b1c1 - Browse repository at this point
Copy the full SHA cd7b1c1View commit details -
optimize qwen2 model on Gaudi (#233)
Add extra mark_step() on each decode layer to optimize the performance on Gaudi. Signed-off-by: Bob Zhu <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 12d7033 - Browse repository at this point
Copy the full SHA 12d7033View commit details -
fix bug: device_str in initialize_ray_cluster requires uppercase stri…
…ng (#297) fix bug: device_str in initialize_ray_cluster requires uppercase string w/o the bug fix, multi HPUs will encounter "ValueError: The number of required hpus exceeds the total number of available hpus in the placement group" error, as the device_str is not expected as uppercase, then available hpus always returns 0.
Configuration menu - View commit details
-
Copy full SHA for bc39baa - Browse repository at this point
Copy the full SHA bc39baaView commit details -
Configuration menu - View commit details
-
Copy full SHA for b2653ab - Browse repository at this point
Copy the full SHA b2653abView commit details -
Configuration menu - View commit details
-
Copy full SHA for 82960d8 - Browse repository at this point
Copy the full SHA 82960d8View commit details -
Configuration menu - View commit details
-
Copy full SHA for f4d2097 - Browse repository at this point
Copy the full SHA f4d2097View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9f8b8e7 - Browse repository at this point
Copy the full SHA 9f8b8e7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 346139d - Browse repository at this point
Copy the full SHA 346139dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6d45443 - Browse repository at this point
Copy the full SHA 6d45443View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3a0ff3b - Browse repository at this point
Copy the full SHA 3a0ff3bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6502b91 - Browse repository at this point
Copy the full SHA 6502b91View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7057da5 - Browse repository at this point
Copy the full SHA 7057da5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 43df762 - Browse repository at this point
Copy the full SHA 43df762View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3134b8a - Browse repository at this point
Copy the full SHA 3134b8aView commit details
Commits on Sep 23, 2024
-
Fix calculating slots for warmup (#310)
Recent changes broke slot sparsity for warmup slots. This commit restores the functionality.
Configuration menu - View commit details
-
Copy full SHA for f92ffc1 - Browse repository at this point
Copy the full SHA f92ffc1View commit details -
Removed padding block from a list of available blocks in allocators (#…
…313) Block 0 is used for padding. This PR removes the padding block from a list of available blocks in block allocators v1 and v2
Configuration menu - View commit details
-
Copy full SHA for 63fae51 - Browse repository at this point
Copy the full SHA 63fae51View commit details -
Fix seq_len for padding sequences (#318)
Before the fix we used seq_len=0 for padding samples. This was later translated to an empty attention_mask (since we don't have any tokens that we should include in calculations) and in turn caused NaNs in prompt attention (0 divided by 0). Those NaNs later got propagated to kv-cache causing issues in flat_pa.
Configuration menu - View commit details
-
Copy full SHA for aa507d4 - Browse repository at this point
Copy the full SHA aa507d4View commit details -
Configuration menu - View commit details
-
Copy full SHA for b70a8c2 - Browse repository at this point
Copy the full SHA b70a8c2View commit details -
Configuration menu - View commit details
-
Copy full SHA for a844837 - Browse repository at this point
Copy the full SHA a844837View commit details -
Fix lora specific conditions in profile-run (#317)
#256 breaks LoRA specific flow which was handled through `is_profile_run` flag to distinguish warmup and profile-run phase. Introduces a new flag `is_lora_profile_run` to handle this LoRA specific flow in profile-run.
Configuration menu - View commit details
-
Copy full SHA for 084db0f - Browse repository at this point
Copy the full SHA 084db0fView commit details -
Configuration menu - View commit details
-
Copy full SHA for a9f94be - Browse repository at this point
Copy the full SHA a9f94beView commit details -
Run with HPU graphs even when warmup was skipped (#320)
Before that PR we relied on stored information which configuration should have HPU graphs enabled. Unfortunately that set was computed during warmup. If we skipped warmup we didn't had that information. This PR allows to run all buckets with HPU graphs enabled when warmup is skipped.
Configuration menu - View commit details
-
Copy full SHA for 9bb65b7 - Browse repository at this point
Copy the full SHA 9bb65b7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2a499c7 - Browse repository at this point
Copy the full SHA 2a499c7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9372734 - Browse repository at this point
Copy the full SHA 9372734View commit details -
Configuration menu - View commit details
-
Copy full SHA for c15ddd2 - Browse repository at this point
Copy the full SHA c15ddd2View commit details -
Configuration menu - View commit details
-
Copy full SHA for f5d254d - Browse repository at this point
Copy the full SHA f5d254dView commit details -
Configuration menu - View commit details
-
Copy full SHA for e00ab5a - Browse repository at this point
Copy the full SHA e00ab5aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3bb593a - Browse repository at this point
Copy the full SHA 3bb593aView commit details -
Configuration menu - View commit details
-
Copy full SHA for f9b222e - Browse repository at this point
Copy the full SHA f9b222eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2f23cb7 - Browse repository at this point
Copy the full SHA 2f23cb7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 28df6fd - Browse repository at this point
Copy the full SHA 28df6fdView commit details -
Configuration menu - View commit details
-
Copy full SHA for c6d2d5a - Browse repository at this point
Copy the full SHA c6d2d5aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 97c398e - Browse repository at this point
Copy the full SHA 97c398eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6a913b3 - Browse repository at this point
Copy the full SHA 6a913b3View commit details -
Move profilers to vllm-hpu-extension (#323)
Continuation of HabanaAI/vllm-hpu-extension#4 I've also removed is_tpu, as it got mistakenly restored in the rebase. It's not in the upstream.
Configuration menu - View commit details
-
Copy full SHA for c64dc83 - Browse repository at this point
Copy the full SHA c64dc83View commit details -
Configuration menu - View commit details
-
Copy full SHA for f56953f - Browse repository at this point
Copy the full SHA f56953fView commit details -
Configuration menu - View commit details
-
Copy full SHA for c562b02 - Browse repository at this point
Copy the full SHA c562b02View commit details -
Configuration menu - View commit details
-
Copy full SHA for cf3bbd2 - Browse repository at this point
Copy the full SHA cf3bbd2View commit details -
Configuration menu - View commit details
-
Copy full SHA for 09357b4 - Browse repository at this point
Copy the full SHA 09357b4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3713da8 - Browse repository at this point
Copy the full SHA 3713da8View commit details -
Configuration menu - View commit details
-
Copy full SHA for bb6564a - Browse repository at this point
Copy the full SHA bb6564aView commit details
Commits on Sep 24, 2024
-
Restore upstream requirements-build.txt (#324)
At some point, someone added whitespaces to each entry in requirements-build.txt. Upstream does not contain it. Easy fix.
Configuration menu - View commit details
-
Copy full SHA for c968320 - Browse repository at this point
Copy the full SHA c968320View commit details -
Remove reminder_comment.yml workflow (#325)
This workflow never worked properly in the fork. This PR removes it.
Configuration menu - View commit details
-
Copy full SHA for 58d5cde - Browse repository at this point
Copy the full SHA 58d5cdeView commit details -
Configuration menu - View commit details
-
Copy full SHA for cf4c3e5 - Browse repository at this point
Copy the full SHA cf4c3e5View commit details -
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
…a/pruned_habana_main
Configuration menu - View commit details
-
Copy full SHA for aa5edcc - Browse repository at this point
Copy the full SHA aa5edccView commit details -
Configuration menu - View commit details
-
Copy full SHA for f6ff4a7 - Browse repository at this point
Copy the full SHA f6ff4a7View commit details -
Configuration menu - View commit details
-
Copy full SHA for a000e62 - Browse repository at this point
Copy the full SHA a000e62View commit details -
This PR fixes all the little warnings gaudi-installation.rst introduces during documentation build ("WARNING: Title underline too short." etc.)
Configuration menu - View commit details
-
Copy full SHA for 41217cf - Browse repository at this point
Copy the full SHA 41217cfView commit details -
FILL IN THE PR DESCRIPTION HERE typo: `platform` -> `platforms` FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Adding or changing kernels</h3> <p>Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.</p> <ul> <li>Make sure custom ops are registered following PyTorch guidelines: <a href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom C++ and CUDA Operators</a> and <a href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The Custom Operators Manual</a></li> <li>Custom operations that return <code>Tensors</code> require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.</li> <li>Use <a href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a> to test the function registration and meta-function for any registered ops. See <code>tests/kernels</code> for examples.</li> <li>When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.</li> <li>If a new custom type is needed, see the following document: <a href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom Class Support in PT2</a>. </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for 4eb9809 - Browse repository at this point
Copy the full SHA 4eb9809View commit details -
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
…a/pruned_habana_main
Configuration menu - View commit details
-
Copy full SHA for c1232e9 - Browse repository at this point
Copy the full SHA c1232e9View commit details -
Configuration menu - View commit details
-
Copy full SHA for 20c87dd - Browse repository at this point
Copy the full SHA 20c87ddView commit details -
Remove vllm.utils.is_hpu() (#331)
vllm.utils.is_hpu() was redundant for some time now and has always been problematic particularly for torch.compile mode. Now, we're fully switching to current_platform.is_hpu().
Configuration menu - View commit details
-
Copy full SHA for 9be37a3 - Browse repository at this point
Copy the full SHA 9be37a3View commit details -
Merge remote-trackng branch 'origin/habana_main' into private/kzawora…
…/pruned_habana_main
Configuration menu - View commit details
-
Copy full SHA for c90e153 - Browse repository at this point
Copy the full SHA c90e153View commit details -
Configuration menu - View commit details
-
Copy full SHA for 874f3d8 - Browse repository at this point
Copy the full SHA 874f3d8View commit details -
Remove logger from layernorm (#332)
Upstream does not use logger in layernorm. Neither do we. No idea why it's there.
Configuration menu - View commit details
-
Copy full SHA for e16918d - Browse repository at this point
Copy the full SHA e16918dView commit details -
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
…a/pruned_habana_main
Configuration menu - View commit details
-
Copy full SHA for 18b0e98 - Browse repository at this point
Copy the full SHA 18b0e98View commit details -
Configuration menu - View commit details
-
Copy full SHA for 347380f - Browse repository at this point
Copy the full SHA 347380fView commit details -
Fix INC FP8 inference after rebase (#333)
This PR fixes the "RuntimeError: HPU does not have device capability." error introduced after rebase & fixes loading weights on CPU for quantization.
Configuration menu - View commit details
-
Copy full SHA for 73f4b48 - Browse repository at this point
Copy the full SHA 73f4b48View commit details -
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
…a/pruned_habana_main
Configuration menu - View commit details
-
Copy full SHA for fc1cf5e - Browse repository at this point
Copy the full SHA fc1cf5eView commit details -
Configuration menu - View commit details
-
Copy full SHA for e2f72e3 - Browse repository at this point
Copy the full SHA e2f72e3View commit details -
Configuration menu - View commit details
-
Copy full SHA for b582d77 - Browse repository at this point
Copy the full SHA b582d77View commit details -
Configuration menu - View commit details
-
Copy full SHA for b90adac - Browse repository at this point
Copy the full SHA b90adacView commit details -
Configuration menu - View commit details
-
Copy full SHA for d853eeb - Browse repository at this point
Copy the full SHA d853eebView commit details -
Make weights_load_device not change EngineArgs.create_load_config() (#…
…336) Some backends rely on calling EngineArgs.create_load_config() directly, for which we've altered the API. We don't need to alter it to enable weight load device functionality. This PR fixes it.
Configuration menu - View commit details
-
Copy full SHA for 9111a80 - Browse repository at this point
Copy the full SHA 9111a80View commit details -
Configuration menu - View commit details
-
Copy full SHA for db8dbce - Browse repository at this point
Copy the full SHA db8dbceView commit details -
Revert "fix guided_decode HPU failing issue"
This reverts commit 8046d81.
Configuration menu - View commit details
-
Copy full SHA for c337e93 - Browse repository at this point
Copy the full SHA c337e93View commit details -
Configuration menu - View commit details
-
Copy full SHA for e8e369f - Browse repository at this point
Copy the full SHA e8e369fView commit details
Commits on Sep 25, 2024
-
Refine INC shutdown code (#335)
This PR removes debug printouts in INC shutdown method and covers the case where application exits before model is initialized properly.
Configuration menu - View commit details
-
Copy full SHA for 8c6dcae - Browse repository at this point
Copy the full SHA 8c6dcaeView commit details -
Setting enough cache_size_limit for torch.compile warmup (#238)
Fix the issue that warmup sometimes doesn't work because the default cache_size_limit is only 8 . --------- Signed-off-by: zehao-intel <[email protected]> Co-authored-by: Andrzej Kotłowski <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cef2f54 - Browse repository at this point
Copy the full SHA cef2f54View commit details -
Change default values for decode bucket flags (#316)
Change default values for decode bucket flags
Configuration menu - View commit details
-
Copy full SHA for 45ee586 - Browse repository at this point
Copy the full SHA 45ee586View commit details -
Support loading checkpoints quantized using Autofp8 (#286)
Support loading https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127 Skip cuda checks Use scaled_fp8_quant instead of _scaled_mm Fix weights and weight_scale for guudi2 flot8_e4m3fn range. --------- Co-authored-by: Nir David <[email protected]> Co-authored-by: Konrad Zawora <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 29fb5ed - Browse repository at this point
Copy the full SHA 29fb5edView commit details
Commits on Sep 26, 2024
-
Fix torch.compile issue of dispatch key set mismatch (#299)
### Issue: torch.compile recompiles after warmup because `tensor 'L['input_ids']' dispatch key set mismatch. expected DispatchKeySet(HPU, BackendSelect), actual DispatchKeySet(HPU, BackendSelect, ADInplaceOrView). ` ### Detail: Run script with `TORCH_LOGS="guards"` and get different dispatch key set info: - warmup: ``` TENSOR_MATCH: check_tensor(L['input_ids'], Tensor, DispatchKeySet(HPU, BackendSelect), torch.int64, device=0, requires_grad=False, size=[2, 1], stride=[1, 1]) # masked_input = input_ # ome/zyuwen/workspace/vllm/habana_main_g3_v2/vllm/model_executor/layers/vocab_parallel_embedding.py:358 in forward ``` - after warmup: ``` TENSOR_MATCH: check_tensor(L['input_ids'], Tensor, DispatchKeySet(HPU, BackendSelect, ADInplaceOrView), torch.int64, device=0, requires_grad=False, size=[2, 1], stride=[1, 1]) # masked_input = input_ # ome/zyuwen/workspace/vllm/habana_main_g3_v2/vllm/model_executor/layers/vocab_parallel_embedding.py:358 in forward ``` ### Solution: The difference in dispatch key set is caused by the 'torch.inference_mode()' decoration, and here is a simple example: ```python import torch import habana_frameworks.torch as htorch @torch.inference_mode() def func(): x = torch.rand(3, 3).to("hpu") print(torch._C._dispatch_key_set(x)) func() # output: DispatchKeySet(HPU, AutocastHPU) ``` ```python import torch import habana_frameworks.torch as htorch def func(): x = torch.rand(3, 3).to("hpu") print(torch._C._dispatch_key_set(x)) func() # output: DispatchKeySet(HPU, ADInplaceOrView, AutogradHPU, AutocastHPU) ``` In vllm-fork, the warmup phase is decorated with `torch.inference_mode()` in [habana_model_runner.py#L1487-L1488](https://github.com/HabanaAI/vllm-fork/blob/b62fba85ac03326e9f466d8d37e91ae1b14a6511/vllm/worker/habana_model_runner.py#L1487-L1488), but the after-warmup phase is not. So in this PR I add the decorator to `prepare_input_tensors` function to keep the dispatch key set the same. --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details> Signed-off-by: yuwenzho <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4c8a6c6 - Browse repository at this point
Copy the full SHA 4c8a6c6View commit details -
Chunk prefill cache writes, remove div_i32 from insert_or_update_cache (
#289) Re-implements following PRs for current habana_main: #102 (Removing div_i32 operations from each layer) #115 (removing scatter for reshape&cache in case of prompt) Accuracy (GSM8K on Llama3.1-8B-Instruct): | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101| | | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101| I've benchmarked this change on Llama3.1-8B-Instruct and on average, +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can be observed across all prefill buckets on G2, with up to +4.40% (+956.79 tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound scenarios.
Configuration menu - View commit details
-
Copy full SHA for 1c6bada - Browse repository at this point
Copy the full SHA 1c6badaView commit details -
Configuration menu - View commit details
-
Copy full SHA for fccaca0 - Browse repository at this point
Copy the full SHA fccaca0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5ffcfa3 - Browse repository at this point
Copy the full SHA 5ffcfa3View commit details
Commits on Sep 27, 2024
-
Fix runtime errors reported when using long input sequence lengths wi…
…th LoRA (#339) This PR has following fixes, - Increase size of indices tensors used to maintain multi-lora state information from max_num_batched_tokens to 3*max_num_batched_tokens. This increase is done to provide buffer for padding done in batch & sequence dimensions. - Move logic to remove padding from lora_logits from execute_model() back to Class LogitsProcessorWithLoRA, this is done to fix race condition caused by updating multi-lora state information directly. FIX #237
Configuration menu - View commit details
-
Copy full SHA for c3577af - Browse repository at this point
Copy the full SHA c3577afView commit details -
Configuration menu - View commit details
-
Copy full SHA for f347a84 - Browse repository at this point
Copy the full SHA f347a84View commit details -
Enable Async output process for HPU (#342)
FILL IN THE PR DESCRIPTION HERE This PR refer to [vllm-project#7049](vllm-project#7049) to implement Asynchronous Output Processor on HPU. It is open by default, to disable it, please pass the `--disable_async_output_proc` flag. From my local test on latest habana_main branch(commit 29fb5ed), the throughput improves from 3847 TPS to 4011 TPS. **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Adding or changing kernels</h3> <p>Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.</p> <ul> <li>Make sure custom ops are registered following PyTorch guidelines: <a href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom C++ and CUDA Operators</a> and <a href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The Custom Operators Manual</a></li> <li>Custom operations that return <code>Tensors</code> require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.</li> <li>Use <a href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a> to test the function registration and meta-function for any registered ops. See <code>tests/kernels</code> for examples.</li> <li>When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.</li> <li>If a new custom type is needed, see the following document: <a href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom Class Support in PT2</a>. </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for ed85058 - Browse repository at this point
Copy the full SHA ed85058View commit details
Commits on Sep 30, 2024
-
Port last_bucket change from v1.18.0 (#347)
Port last_bucket change from v1.18.0
Configuration menu - View commit details
-
Copy full SHA for b611e20 - Browse repository at this point
Copy the full SHA b611e20View commit details -
Add setuptools_scm to requirements-hpu.txt (#349)
This removes the crash during installation for dependency that's inside requirements-build.txt
Configuration menu - View commit details
-
Copy full SHA for 3010f8c - Browse repository at this point
Copy the full SHA 3010f8cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 44d8173 - Browse repository at this point
Copy the full SHA 44d8173View commit details -
Configuration menu - View commit details
-
Copy full SHA for 188bd3a - Browse repository at this point
Copy the full SHA 188bd3aView commit details -
Configuration menu - View commit details
-
Copy full SHA for f59495a - Browse repository at this point
Copy the full SHA f59495aView commit details -
Configuration menu - View commit details
-
Copy full SHA for b0a9d02 - Browse repository at this point
Copy the full SHA b0a9d02View commit details
Commits on Oct 1, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 70f544c - Browse repository at this point
Copy the full SHA 70f544cView commit details -
Configuration menu - View commit details
-
Copy full SHA for ec34f88 - Browse repository at this point
Copy the full SHA ec34f88View commit details -
Fixed lora manager tests (#315)
Added the hpu related changes along with gpu to conftest.py file and test_lora_manager_hpu.py
Configuration menu - View commit details
-
Copy full SHA for c7b1509 - Browse repository at this point
Copy the full SHA c7b1509View commit details -
Configuration menu - View commit details
-
Copy full SHA for cafff17 - Browse repository at this point
Copy the full SHA cafff17View commit details
Commits on Oct 2, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 25f4ed9 - Browse repository at this point
Copy the full SHA 25f4ed9View commit details
Commits on Oct 3, 2024
-
Lora Mask based on lora index (#348)
Changes the filling of lora mask from lora_id to lora_index. This is needed to ensure that the mask does not fail in case lora id is greater than max_loras
Configuration menu - View commit details
-
Copy full SHA for da03d8b - Browse repository at this point
Copy the full SHA da03d8bView commit details -
Add rope_scaling support for LLama3.1 (#356)
Add support for rope scaling and FusedRoPE in LLama3.1
Configuration menu - View commit details
-
Copy full SHA for f848d27 - Browse repository at this point
Copy the full SHA f848d27View commit details
Commits on Oct 4, 2024
-
[Core] Support Torch profiler in Habana Worker (#357)
This PR allows to profile execution on HPU through flag VLLM_TORCH_PROFILER_DIR. Similar as it is done for GPU. The profiling can be controlled: 1. Asynchronously by posting requests to the server: a) to start collecting profile: ` curl -X POST http://localhost:8080/start_profile ` b) to stop collecting profile: ` curl -X POST http://localhost:8080/stop_profile ` 2. In script, by instructing LLM object to start and stop profiling: ```python from vllm import LLM, SamplingParams llm = LLM(...) llm.start_profile() llm.stop_profile() ```
Configuration menu - View commit details
-
Copy full SHA for d8ba780 - Browse repository at this point
Copy the full SHA d8ba780View commit details -
Configuration menu - View commit details
-
Copy full SHA for 250487b - Browse repository at this point
Copy the full SHA 250487bView commit details -
Configuration menu - View commit details
-
Copy full SHA for eb095b3 - Browse repository at this point
Copy the full SHA eb095b3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 65fa6f6 - Browse repository at this point
Copy the full SHA 65fa6f6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0576360 - Browse repository at this point
Copy the full SHA 0576360View commit details -
Merge remote-tracking branch 'upstream/main' into private/kzawora/hab…
…ana_hpu_refactor
Configuration menu - View commit details
-
Copy full SHA for 7f73cc9 - Browse repository at this point
Copy the full SHA 7f73cc9View commit details -
Configuration menu - View commit details
-
Copy full SHA for b4e26d3 - Browse repository at this point
Copy the full SHA b4e26d3View commit details -
[Refactor] Rename components *Habana* -> *HPU* (#359)
Refactoring Gaudi-specific components to use `hpu` name instead of `habana` (e.g. `habana_model_runner.py` -> `hpu_model_runner.py`, `habana_executor.py` -> `hpu_executor.py`, etc.), as suggested in the upstream PR.
Configuration menu - View commit details
-
Copy full SHA for cfe231d - Browse repository at this point
Copy the full SHA cfe231dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 38e60f4 - Browse repository at this point
Copy the full SHA 38e60f4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 76cbbb5 - Browse repository at this point
Copy the full SHA 76cbbb5View commit details -
Merge remote-tracking branch 'origin/habana_main' into private/kzawor…
…a/pruned_habana_main
Configuration menu - View commit details
-
Copy full SHA for 95a7ece - Browse repository at this point
Copy the full SHA 95a7eceView commit details -
Revert "Support loading checkpoints quantized using Autofp8 (#286)"
This reverts commit 29fb5ed.
Configuration menu - View commit details
-
Copy full SHA for d7d609f - Browse repository at this point
Copy the full SHA d7d609fView commit details -
Configuration menu - View commit details
-
Copy full SHA for c07cbc6 - Browse repository at this point
Copy the full SHA c07cbc6View commit details -
Configuration menu - View commit details
-
Copy full SHA for d90bbce - Browse repository at this point
Copy the full SHA d90bbceView commit details -
Configuration menu - View commit details
-
Copy full SHA for 84dc6c5 - Browse repository at this point
Copy the full SHA 84dc6c5View commit details -
Configuration menu - View commit details
-
Copy full SHA for f7288de - Browse repository at this point
Copy the full SHA f7288deView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6899c3f - Browse repository at this point
Copy the full SHA 6899c3fView commit details -
Configuration menu - View commit details
-
Copy full SHA for e5d640e - Browse repository at this point
Copy the full SHA e5d640eView commit details -
Update vllm/model_executor/layers/logits_processor.py
Co-authored-by: Woosuk Kwon <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 25388e2 - Browse repository at this point
Copy the full SHA 25388e2View commit details -
Configuration menu - View commit details
-
Copy full SHA for b4f7ffa - Browse repository at this point
Copy the full SHA b4f7ffaView commit details -
Configuration menu - View commit details
-
Copy full SHA for 43959db - Browse repository at this point
Copy the full SHA 43959dbView commit details -
Configuration menu - View commit details
-
Copy full SHA for b8404ad - Browse repository at this point
Copy the full SHA b8404adView commit details -
Configuration menu - View commit details
-
Copy full SHA for d38564f - Browse repository at this point
Copy the full SHA d38564fView commit details -
Merge remote-tracking branch 'origin/private/kzawora/hpu_attn' into p…
…rivate/kzawora/pruned_habana_main
Configuration menu - View commit details
-
Copy full SHA for eed1b05 - Browse repository at this point
Copy the full SHA eed1b05View commit details -
Merge remote-tracking branch 'origin/private/kzawora/hpu_bf16_default…
…' into private/kzawora/pruned_habana_main
Configuration menu - View commit details
-
Copy full SHA for 5c3e29c - Browse repository at this point
Copy the full SHA 5c3e29cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 33c1db0 - Browse repository at this point
Copy the full SHA 33c1db0View commit details -
Configuration menu - View commit details
-
Copy full SHA for 05777e0 - Browse repository at this point
Copy the full SHA 05777e0View commit details
Commits on Oct 7, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 1f6de5d - Browse repository at this point
Copy the full SHA 1f6de5dView commit details -
[Refactor] Rename HabanaAttention -> HPUAttention (#362)
I've missed the attention backend in #359
Configuration menu - View commit details
-
Copy full SHA for ad08dd4 - Browse repository at this point
Copy the full SHA ad08dd4View commit details -
Use BF16 on HPU by default (#361)
We don't *officially* support FP16, and for the most part, we use BF16 wherever we can. This removes the need of specifying `--dtype bfloat16` - when `dtype` is not provided (is `auto`), and model default data type is `float16`, we cast it to `bfloat16` for HPU.
Configuration menu - View commit details
-
Copy full SHA for e00750e - Browse repository at this point
Copy the full SHA e00750eView commit details -
Set vllm-hpu-extension to 36c7f9c (#365)
This includes: HabanaAI/vllm-hpu-extension#8 (BlockSoftmax: fix guard value for fp16)
Configuration menu - View commit details
-
Copy full SHA for db5aed6 - Browse repository at this point
Copy the full SHA db5aed6View commit details -
Add AliBi to supported features in README_GAUDI.md (#287)
ALiBi was fixed in #254, so it should be added to supported features in README.
Configuration menu - View commit details
-
Copy full SHA for 902f575 - Browse repository at this point
Copy the full SHA 902f575View commit details -
Configuration menu - View commit details
-
Copy full SHA for 27c05e1 - Browse repository at this point
Copy the full SHA 27c05e1View commit details -
Configuration menu - View commit details
-
Copy full SHA for bb4c23e - Browse repository at this point
Copy the full SHA bb4c23eView commit details -
Fix hpu_set_env call in load_model in vllm (#364)
FILL IN THE PR DESCRIPTION HERE FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details> <!-- inside this <details> section, markdown rendering does not work, so we use raw html here. --> <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Adding or changing kernels</h3> <p>Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.</p> <ul> <li>Make sure custom ops are registered following PyTorch guidelines: <a href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom C++ and CUDA Operators</a> and <a href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The Custom Operators Manual</a></li> <li>Custom operations that return <code>Tensors</code> require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.</li> <li>Use <a href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a> to test the function registration and meta-function for any registered ops. See <code>tests/kernels</code> for examples.</li> <li>When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.</li> <li>If a new custom type is needed, see the following document: <a href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom Class Support in PT2</a>. </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>
Configuration menu - View commit details
-
Copy full SHA for 563184a - Browse repository at this point
Copy the full SHA 563184aView commit details
Commits on Oct 8, 2024
-
Update offline_inference_fakehpu.py
Beam search was removed from SamplingParams. In this example it was set to False, with this commit I removed it
Configuration menu - View commit details
-
Copy full SHA for 0e46492 - Browse repository at this point
Copy the full SHA 0e46492View commit details -
Timeout adjusted in MLLMEngine (#368)
Currently in Multiprocess LLMEngine there is a polling timeout fixed to 10000 ms . This may not be good when we are running torch compiled models that happen to compile (we did not have particular configuration -- shape -- model warmed up during warmup phase). So torch compilation if happens after warmup then 10000ms is not enough. So It would be good to have a way to modify fixed timeout. Changes disscussed here are replacing fixed timeout of 10000 ms with value as provided with VLLM_RPC_TIMEOUT . Please suggest if separate env var should be made. Co-authored-by: Jacek Czaja <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6028354 - Browse repository at this point
Copy the full SHA 6028354View commit details -
Configuration menu - View commit details
-
Copy full SHA for 64369fd - Browse repository at this point
Copy the full SHA 64369fdView commit details -
Configuration menu - View commit details
-
Copy full SHA for 69fb91c - Browse repository at this point
Copy the full SHA 69fb91cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 1ee20c5 - Browse repository at this point
Copy the full SHA 1ee20c5View commit details -
Make workaround for SW-204785 broader (#374)
PT bridge bug in recent Synapse builds causes PyTest to return 0 unconditionally. Previous workaround fixed that issue if comparison failed, but left out a case in which vLLM (or anything else) actually crashes during the test execution. This patch broadens the workaround to catch any exceptions and add atexit callback when any test fails.
Configuration menu - View commit details
-
Copy full SHA for 388e500 - Browse repository at this point
Copy the full SHA 388e500View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8f79b6e - Browse repository at this point
Copy the full SHA 8f79b6eView commit details
Commits on Oct 9, 2024
-
Configuration menu - View commit details
-
Copy full SHA for ca98dae - Browse repository at this point
Copy the full SHA ca98daeView commit details
Commits on Oct 10, 2024
-
Fix LoRA tests by handling broken import (#376)
This PR fixes the broken import in test_lora_hpu.py Issue: https://jira.habana-labs.com/browse/SW-204811
Configuration menu - View commit details
-
Copy full SHA for 4030216 - Browse repository at this point
Copy the full SHA 4030216View commit details -
Configuration menu - View commit details
-
Copy full SHA for b70c1a5 - Browse repository at this point
Copy the full SHA b70c1a5View commit details
Commits on Oct 11, 2024
-
Disable performance counters if profiler is not enabled (#383)
Currently, if `HabanaHighLevelProfiler` is not enabled, `HabanaProfilerCounterHelper` collects the statistics that will not be used later. This creates additional host overhead that can be removed. This change will only allow performance statistics to be collected when the profiler is enabled. Potential gain on `prepare_model_input`: - before <img width="437" alt="image" src="https://github.com/user-attachments/assets/c351c6be-2757-455d-a005-b34e97d47fd6"> - after <img width="401" alt="image" src="https://github.com/user-attachments/assets/80b7c1d1-051e-4a64-9e7c-eff9cc8d9558">
Configuration menu - View commit details
-
Copy full SHA for 49444bc - Browse repository at this point
Copy the full SHA 49444bcView commit details -
Configuration menu - View commit details
-
Copy full SHA for d6bd375 - Browse repository at this point
Copy the full SHA d6bd375View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4f1787b - Browse repository at this point
Copy the full SHA 4f1787bView commit details
Commits on Oct 12, 2024
-
Remove constraints for bucket creation during warmup in LoRA (#382)
This PR removes LoRA constraints during bucket creation in warm-up. Fixes high drop in decode throughput when LoRA is enabled for a given configuration.
Configuration menu - View commit details
-
Copy full SHA for 6cd4694 - Browse repository at this point
Copy the full SHA 6cd4694View commit details
Commits on Oct 14, 2024
-
seed_everything function doesn't handle HPU (#384)
This PR adds manual seed setting for HPU in the function `seed_everything`. Previously the torch.manual_seed was getting set to the given seed, which got removed in the following PR 6ffa3f3
Configuration menu - View commit details
-
Copy full SHA for d8f2aa7 - Browse repository at this point
Copy the full SHA d8f2aa7View commit details -
Fixed lora_manager tests with hpu_model_runner (#386)
lora_manager tests have been fixed with the recent changes of hpu_model_runner from habana_model_runner
Configuration menu - View commit details
-
Copy full SHA for 03b407b - Browse repository at this point
Copy the full SHA 03b407bView commit details -
Reformat README_GAUDI.md (#389)
This PR removes the awkward line breaks in README_GAUDI.md and uses appropriate markdown formatting instead of RST. Rendered document should look the same.
Configuration menu - View commit details
-
Copy full SHA for ebd42c4 - Browse repository at this point
Copy the full SHA ebd42c4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2d2bf7a - Browse repository at this point
Copy the full SHA 2d2bf7aView commit details -
Remove workaround added to resolve multi-card stall issue (#387)
This PR removes additional `multiprocessing.Process` object created as a workaround for resolving multi-card stall issue.
Configuration menu - View commit details
-
Copy full SHA for 9df1d4a - Browse repository at this point
Copy the full SHA 9df1d4aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 9777c9f - Browse repository at this point
Copy the full SHA 9777c9fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5ceda69 - Browse repository at this point
Copy the full SHA 5ceda69View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3e6a2d4 - Browse repository at this point
Copy the full SHA 3e6a2d4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9ac52ab - Browse repository at this point
Copy the full SHA 9ac52abView commit details -
Configuration menu - View commit details
-
Copy full SHA for 57bc31d - Browse repository at this point
Copy the full SHA 57bc31dView commit details
Commits on Oct 15, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 55dd07e - Browse repository at this point
Copy the full SHA 55dd07eView commit details -
[CI] Temporarily increase test tolerances (#392)
This PR raises the allowed relative tolerance in GSM8K to 0.06, and moves Llama-70B test to 4xG2 from 2xG2 until memory usage is investigated (success run: vLLM-CI-Pipeline/206)
Configuration menu - View commit details
-
Copy full SHA for 401f5ae - Browse repository at this point
Copy the full SHA 401f5aeView commit details -
Configuration menu - View commit details
-
Copy full SHA for e598f3f - Browse repository at this point
Copy the full SHA e598f3fView commit details
Commits on Oct 16, 2024
-
Softmax: add weighted-sum normalization (#378)
Supporting PR for HabanaAI/vllm-hpu-extension#10
Configuration menu - View commit details
-
Copy full SHA for f77435d - Browse repository at this point
Copy the full SHA f77435dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0783d18 - Browse repository at this point
Copy the full SHA 0783d18View commit details -
Configuration menu - View commit details
-
Copy full SHA for 2fa46cd - Browse repository at this point
Copy the full SHA 2fa46cdView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3683db6 - Browse repository at this point
Copy the full SHA 3683db6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 91af5da - Browse repository at this point
Copy the full SHA 91af5daView commit details -
Configuration menu - View commit details
-
Copy full SHA for d2ce468 - Browse repository at this point
Copy the full SHA d2ce468View commit details -
Configuration menu - View commit details
-
Copy full SHA for b6428cd - Browse repository at this point
Copy the full SHA b6428cdView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5149278 - Browse repository at this point
Copy the full SHA 5149278View commit details -
Configuration menu - View commit details
-
Copy full SHA for f4b356f - Browse repository at this point
Copy the full SHA f4b356fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3eee00d - Browse repository at this point
Copy the full SHA 3eee00dView commit details -
Remove HPU changes from cache_engine.py (#400)
We were asked on upstream PR to remove our changes from cache_engine.py. This PR does just that, and creates HPUCacheEngine inheriting from CacheEngine, just overriding _allocate_kv_cache method.
Configuration menu - View commit details
-
Copy full SHA for a59fc7b - Browse repository at this point
Copy the full SHA a59fc7bView commit details -
Configuration menu - View commit details
-
Copy full SHA for c07951b - Browse repository at this point
Copy the full SHA c07951bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 398c5c3 - Browse repository at this point
Copy the full SHA 398c5c3View commit details -
Configuration menu - View commit details
-
Copy full SHA for f79d454 - Browse repository at this point
Copy the full SHA f79d454View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8b6e30d - Browse repository at this point
Copy the full SHA 8b6e30dView commit details
Commits on Oct 17, 2024
-
[bucketing overhaul 1/n] Add padding-aware scheduling and option to l…
…imit prefill batch size (#394) This PR adds following functionality that can be enabled via engine flags: - use_padding_aware_scheduling - vLLM scheduler will now calculate token cost considering padded prefill shape (similar to #109). - max_num_prefill_seqs - padding-aware scheduler will perform an additional check for prefill batch size and will effectively limit prefill batch size at maximum of `max_num_prefill_seqs`. If unset, max prefill batch size will be `max_num_seqs`. Both features are generic and do not require HPU, although they may be specialized for particular vendor's usage. Padding aware scheduling includes padding function selector which selects HPU padding function (considering currently used HPU buckets) if current device is HPU. Otherwise, it will take a product of batch_size x max_seq_len.
Configuration menu - View commit details
-
Copy full SHA for 05bcdf5 - Browse repository at this point
Copy the full SHA 05bcdf5View commit details -
Configuration menu - View commit details
-
Copy full SHA for c11f23a - Browse repository at this point
Copy the full SHA c11f23aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 78a816c - Browse repository at this point
Copy the full SHA 78a816cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 640f0be - Browse repository at this point
Copy the full SHA 640f0beView commit details -
Configuration menu - View commit details
-
Copy full SHA for e894746 - Browse repository at this point
Copy the full SHA e894746View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5bc3985 - Browse repository at this point
Copy the full SHA 5bc3985View commit details -
Configuration menu - View commit details
-
Copy full SHA for 14f8af4 - Browse repository at this point
Copy the full SHA 14f8af4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 65e34f6 - Browse repository at this point
Copy the full SHA 65e34f6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4757350 - Browse repository at this point
Copy the full SHA 4757350View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4c306cf - Browse repository at this point
Copy the full SHA 4c306cfView commit details