Port last bucket change #346

iboiko-habana · 2024-09-28T20:30:20Z

Port last bucket from v1.18.0

…ase_v4

SiLU memory leak in fwd

…a/rebase_v4

…ase_v4

habana_main rebase v4

…107)

* Re-enable FusedRoPE for Gaudi1 * add fallback impl of rope

* formatting fixes * Upstream CR update

…eware layer (vllm-project#8672)

…project#8767) Signed-off-by: darthhexx <[email protected]>

This PR removes debug printouts in INC shutdown method and covers the case where application exits before model is initialized properly.

Fix the issue that warmup sometimes doesn't work because the default cache_size_limit is only 8 . --------- Signed-off-by: zehao-intel <[email protected]> Co-authored-by: Andrzej Kotłowski <[email protected]>

Change default values for decode bucket flags

…grade (vllm-project#8777)

Support loading https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127 Skip cuda checks Use scaled_fp8_quant instead of _scaled_mm Fix weights and weight_scale for guudi2 flot8_e4m3fn range. --------- Co-authored-by: Nir David <[email protected]> Co-authored-by: Konrad Zawora <[email protected]>

…ect#8760)

…llm-project#8760) (vllm-project#8810)

…#8811) Co-authored-by: simon-mo <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]>

### Issue: torch.compile recompiles after warmup because `tensor 'L['input_ids']' dispatch key set mismatch. expected DispatchKeySet(HPU, BackendSelect), actual DispatchKeySet(HPU, BackendSelect, ADInplaceOrView). ` ### Detail: Run script with `TORCH_LOGS="guards"` and get different dispatch key set info: - warmup: ``` TENSOR_MATCH: check_tensor(L['input_ids'], Tensor, DispatchKeySet(HPU, BackendSelect), torch.int64, device=0, requires_grad=False, size=[2, 1], stride=[1, 1]) # masked_input = input_ # ome/zyuwen/workspace/vllm/habana_main_g3_v2/vllm/model_executor/layers/vocab_parallel_embedding.py:358 in forward ``` - after warmup: ``` TENSOR_MATCH: check_tensor(L['input_ids'], Tensor, DispatchKeySet(HPU, BackendSelect, ADInplaceOrView), torch.int64, device=0, requires_grad=False, size=[2, 1], stride=[1, 1]) # masked_input = input_ # ome/zyuwen/workspace/vllm/habana_main_g3_v2/vllm/model_executor/layers/vocab_parallel_embedding.py:358 in forward ``` ### Solution: The difference in dispatch key set is caused by the 'torch.inference_mode()' decoration, and here is a simple example: ```python import torch import habana_frameworks.torch as htorch @torch.inference_mode() def func(): x = torch.rand(3, 3).to("hpu") print(torch._C._dispatch_key_set(x)) func() # output: DispatchKeySet(HPU, AutocastHPU) ``` ```python import torch import habana_frameworks.torch as htorch def func(): x = torch.rand(3, 3).to("hpu") print(torch._C._dispatch_key_set(x)) func() # output: DispatchKeySet(HPU, ADInplaceOrView, AutogradHPU, AutocastHPU) ``` In vllm-fork, the warmup phase is decorated with `torch.inference_mode()` in [habana_model_runner.py#L1487-L1488](https://github.com/HabanaAI/vllm-fork/blob/b62fba85ac03326e9f466d8d37e91ae1b14a6511/vllm/worker/habana_model_runner.py#L1487-L1488), but the after-warmup phase is not. So in this PR I add the decorator to `prepare_input_tensors` function to keep the dispatch key set the same. --- <details>  <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details> Signed-off-by: yuwenzho <[email protected]>

#289) Re-implements following PRs for current habana_main: #102 (Removing div_i32 operations from each layer) #115 (removing scatter for reshape&cache in case of prompt) Accuracy (GSM8K on Llama3.1-8B-Instruct): | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101| | | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101| I've benchmarked this change on Llama3.1-8B-Instruct and on average, +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can be observed across all prefill buckets on G2, with up to +4.40% (+956.79 tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound scenarios.

…oject#8837)

…th LoRA (#339) This PR has following fixes, - Increase size of indices tensors used to maintain multi-lora state information from max_num_batched_tokens to 3*max_num_batched_tokens. This increase is done to provide buffer for padding done in batch & sequence dimensions. - Move logic to remove padding from lora_logits from execute_model() back to Class LogitsProcessorWithLoRA, this is done to fix race condition caused by updating multi-lora state information directly. FIX #237

you know the drill

FILL IN THE PR DESCRIPTION HERE This PR refer to [vllm-project#7049](vllm-project#7049) to implement Asynchronous Output Processor on HPU. It is open by default, to disable it, please pass the `--disable_async_output_proc` flag. From my local test on latest habana_main branch(commit 29fb5ed), the throughput improves from 3847 TPS to 4011 TPS. **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- <details>  <summary><b> PR Checklist (Click to Expand) </b></summary> <p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p> <h3>PR Title and Classification</h3> <p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p> <ul> <li><code>[Bugfix]</code> for bug fixes.</li> <li><code>[CI/Build]</code> for build or continuous integration improvements.</li> <li><code>[Doc]</code> for documentation fixes and improvements.</li> <li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li> <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li> <li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li> <li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li> <li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li> <li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li> </ul> <p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p> <h3>Code Quality</h3> <p>The PR need to meet the following code quality standards:</p> <ul> <li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li> <li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li> <li>The code need to be well-documented to ensure future contributors can easily understand the code.</li> <li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li> <li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li> </ul> <h3>Adding or changing kernels</h3> <p>Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.</p> <ul> <li>Make sure custom ops are registered following PyTorch guidelines: <a href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom C++ and CUDA Operators</a> and <a href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The Custom Operators Manual</a></li> <li>Custom operations that return <code>Tensors</code> require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.</li> <li>Use <a href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a> to test the function registration and meta-function for any registered ops. See <code>tests/kernels</code> for examples.</li> <li>When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.</li> <li>If a new custom type is needed, see the following document: <a href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom Class Support in PT2</a>. </ul> <h3>Notes for Large Changes</h3> <p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p> <h3>What to Expect for the Reviews</h3> <p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p> <ul> <li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li> <li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li> <li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li> <li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion. </li> </ul> <h3>Thank You</h3> <p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p> </details>

kzawora-intel and others added 30 commits July 3, 2024 19:24

fix is_prompt for mixtral

bca41a1

Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…

b309c26

…ase_v4

restore HPU autodetection

717c0ce

SiLU memory leak in fwd

8a4c5c1

add WA for model loader

c5cd04a

remove hpu model loader WA

9efb594

fix hpu autodetection (again)

def464e

fix VLM configs in hpu components

30f36f0

Merge pull request #87 from HabanaAI/michalkuligowski-patch-1

60df235

SiLU memory leak in fwd

Merge remote-tracking branch 'origin/habana_main' into private/kzawor…

0facde4

…a/rebase_v4

fix hpu autodetection

1dd8502

Remove invasive ALiBi changes

0836502

add VLLM_TARGET_DEVICE='hpu'

a2f361c

Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…

c49d033

…ase_v4

Added docstring and assertion to warmup_range

08ba388

fix api mismatches

6bed248

add assert for attn type

03dbee5

multi-hpu fixes

8c58a66

minor formatting stuff

d7afbf2

fix sampling metadata for prefill

2b2549c

bump ray version for hpu

e911fd8

Merge remote-tracking branch 'upstream/main' into private/kzawora/reb…

202d0b9

…ase_v4

Merge pull request #85 from HabanaAI/private/kzawora/rebase_v4

291bee7

habana_main rebase v4

Merge remote-tracking branch 'upstream/main' into HEAD

72f96e4

split k scale and v scale in habana attn

bf349c5

Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx (#…

8e231a5

…107)

Refactor forward_hpu of RMSNorm (#128)

f7dc554

Refactor & re-enable HPU RoPE for Gaudi1 (#129)

19993b7

* Re-enable FusedRoPE for Gaudi1 * add fallback impl of rope

formatting fixes (#132)

03e3ce3

Address upstream PR code review comments (#133)

a0646da

* formatting fixes * Upstream CR update

sohamparikh and others added 28 commits September 24, 2024 23:16

[Bugfix] load fc bias from config for eagle (vllm-project#8790)

3e073e6

[Frontend] OpenAI server: propagate usage accounting to FastAPI middl…

1ac3de0

…eware layer (vllm-project#8672)

[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (vllm-…

3368c3a

…project#8767) Signed-off-by: darthhexx <[email protected]>

[Misc] Fix minor typo in scheduler (vllm-project#8765)

8fae5ed

Refine INC shutdown code (#335)

8c6dcae

This PR removes debug printouts in INC shutdown method and covers the case where application exits before model is initialized properly.

Setting enough cache_size_limit for torch.compile warmup (#238)

cef2f54

Fix the issue that warmup sometimes doesn't work because the default cache_size_limit is only 8 . --------- Signed-off-by: zehao-intel <[email protected]> Co-authored-by: Andrzej Kotłowski <[email protected]>

Change default values for decode bucket flags (#316)

45ee586

Change default values for decode bucket flags

[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 up…

1c04644

…grade (vllm-project#8777)

[Kernel] Fullgraph and opcheck tests (vllm-project#8479)

300da09

[[Misc]] Add extra deps for openai server image (vllm-project#8792)

c6f2485

[VLM][Bugfix] internvl with num_scheduler_steps > 1 (vllm-project#8614)

0c4d2ad

rename PromptInputs and inputs with backward compatibility (vllm-proj…

28e1299

…ect#8760)

[Frontend] MQLLMEngine supports profiling. (vllm-project#8761)

64840df

[Misc] Support FP8 MoE for compressed-tensors (vllm-project#8588)

873edda

Revert "rename PromptInputs and inputs with backward compatibility (v…

4f1ba08

…llm-project#8760) (vllm-project#8810)

[Doc] Update doc for Transformers 4.45 (vllm-project#8817)

e2c6e0a

[Misc] Support quantization of MllamaForCausalLM (vllm-project#8822)

7193774

[Misc] Update config loading for Qwen2-VL and remove Granite (vllm-pr…

4bb98f2

…oject#8837)

Merge remote-tracking branch 'upstream/main' into HEAD

fccaca0

Update cpu-test.yml

5ffcfa3

vLLM 0.6.2 rebase (#340)

f347a84

you know the drill

Port last_bucket change from v1.18.0

d5789f7

iboiko-habana closed this Sep 28, 2024

michalkuligowski deleted the port_last_bucket branch October 1, 2024 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port last bucket change #346

Port last bucket change #346

iboiko-habana commented Sep 28, 2024

Port last bucket change #346

Port last bucket change #346

Conversation

iboiko-habana commented Sep 28, 2024