Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Upstream test PR #322

Closed
wants to merge 431 commits into from
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Sep 3, 2024

  1. Remove mark step from static MoE loop (#231)

    Removes unnecessary mark step from MoE OP loop to speed up computation
    jkaniecki authored Sep 3, 2024
    Configuration menu
    Copy the full SHA
    b4f6a29 View commit details
    Browse the repository at this point in the history
  2. Add newline at EOF

    Signed-off-by: Chendi.Xue <[email protected]>
    xuechendi committed Sep 3, 2024
    Configuration menu
    Copy the full SHA
    733524a View commit details
    Browse the repository at this point in the history
  3. Remove requires_grad=False

    Signed-off-by: Chendi.Xue <[email protected]>
    xuechendi committed Sep 3, 2024
    Configuration menu
    Copy the full SHA
    fb98cad View commit details
    Browse the repository at this point in the history

Commits on Sep 4, 2024

  1. Change mask to lora_mask

    hlahkar committed Sep 4, 2024
    Configuration menu
    Copy the full SHA
    49ffde6 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    538c8f1 View commit details
    Browse the repository at this point in the history
  3. Enable llama-405b - w/a for memory allocation error (#184)

    Work around for allocation error while loading llama-405b.
    afierka-intel authored Sep 4, 2024
    Configuration menu
    Copy the full SHA
    691255b View commit details
    Browse the repository at this point in the history
  4. [bugfix] handle large bucket minimums correctly (#235)

    This bugfix addresses incorrect lower boundary handling for bucketing
    
    Previous behavior:
    ```
    INFO 09-03 19:36:28 habana_model_runner.py:564] Prompt bucket config (min, step, max_warmup) bs:[64, 32, 64], seq:[768, 128, 768]
    INFO 09-03 19:36:28 habana_model_runner.py:577] Generated 12 prompt buckets: [(32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768)]
    INFO 09-03 19:36:28 habana_model_runner.py:582] Omitted 0 prompt buckets due to exceeded token budget (max_num_batched_tokens=131072)
    INFO 09-03 19:36:28 habana_model_runner.py:590] Decode bucket config (min, step, max_warmup) bs:[64, 128, 64], seq:[768, 128, 1024]
    INFO 09-03 19:36:28 habana_model_runner.py:601] Generated 8 decode buckets: [(64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024)]
    INFO 09-03 19:36:28 habana_model_runner.py:606] Omitted 0 decode buckets due to exceeded token budget (max_num_batched_tokens=131072)
    ```
    Min seq len dimension is set to 768, but buckets with seq_len=128-768
    are present
    
    Current behavior:
    
    ```
    INFO 09-03 19:45:42 habana_model_runner.py:563] Prompt bucket config (min, step, max_warmup) bs:[64, 32, 64], seq:[768, 128, 768]
    INFO 09-03 19:45:42 habana_model_runner.py:576] Generated 1 prompt buckets: [(64, 768)]
    INFO 09-03 19:45:42 habana_model_runner.py:581] Omitted 0 prompt buckets due to exceeded token budget (max_num_batched_tokens=131072)
    INFO 09-03 19:45:42 habana_model_runner.py:589] Decode bucket config (min, step, max_warmup) bs:[64, 128, 64], seq:[768, 128, 1024]
    INFO 09-03 19:45:42 habana_model_runner.py:600] Generated 3 decode buckets: [(64, 768), (64, 896), (64, 1024)]
    INFO 09-03 19:45:42 habana_model_runner.py:605] Omitted 0 decode buckets due to exceeded token budget (max_num_batched_tokens=131072)
    ```
    No bucket with seq_len < 768 is captured
    kzawora-intel authored Sep 4, 2024
    Configuration menu
    Copy the full SHA
    a4e1d52 View commit details
    Browse the repository at this point in the history
  5. fix guided_decode HPU failing issue

    Signed-off-by: Chendi.Xue <[email protected]>
    xuechendi committed Sep 4, 2024
    Configuration menu
    Copy the full SHA
    8046d81 View commit details
    Browse the repository at this point in the history

Commits on Sep 5, 2024

  1. Remove token budget from decode buckets (#241)

    This PR prevents max_num_batched_tokens from limiting decode buckets, as
    decode buckets should be limited by number of blocks, not by
    max_num_batched_tokens.
    kzawora-intel authored Sep 5, 2024
    Configuration menu
    Copy the full SHA
    7cd226c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d0eb7d7 View commit details
    Browse the repository at this point in the history
  3. Mask based BGMV implementation (#223)

    Refactors BGMV implementation from gather based to mask-based to
    optimize performance and reduce device memory usage.
    vivekgoe authored Sep 5, 2024
    Configuration menu
    Copy the full SHA
    05acb89 View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2024

  1. fix rotary embedding

    jikunshang committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    d2e2854 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    97bd0fd View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    ededdaf View commit details
    Browse the repository at this point in the history
  4. Update test

    SanjuCSudhakaran committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    b507cc4 View commit details
    Browse the repository at this point in the history
  5. Fix formatting

    SanjuCSudhakaran committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    016f343 View commit details
    Browse the repository at this point in the history
  6. Dispersed dummy slots (#243)

    Use all possible slot values for dummy blocks to avoid caching issues.
    madamczykhabana authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    d9fa7cf View commit details
    Browse the repository at this point in the history
  7. Use PT_COMPILE_ONLY_MODE during warmup (#227)

    With PT_COMPILE_ONLY_MODE flag, graphs can be compiled without
    performing synLaunch. The flag has been added to the warmup phase to
    decrease its execution time.
    mfylcek authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    7488c58 View commit details
    Browse the repository at this point in the history
  8. Do not pass warmup_mode to execute_model_kwargs (#229)

    This fixes a very silly issue where mismatching values of `warmup_mode`
    flag could cause graph recompilations and eventually memory leaks.
    kzawora-intel authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    17447ed View commit details
    Browse the repository at this point in the history
  9. Add error handling for PT_COMPILE_ONLY_MODE (#251)

    This PR fixes crashes observed on older Synapse builds introduced with
    #227. Setting
    PT_COMPILE_ONLY_MODE is not supported in current or older public Synapse
    builds, but we should not crash because of it, rather we should advise
    user to use the latest build.
    
    Previous behavior:
    ```
    ...
    INFO 09-06 17:08:37 habana_executor.py:85] # HPU blocks: 10761, # CPU blocks: 910
    INFO 09-06 17:08:37 habana_worker.py:201] Initializing cache engine took 47.29 GiB of device memory (54.34 GiB/94.62 GiB used) and -159.6 MiB of host memory (414.9 GiB/1007 GiB used)
    [rank0]: Traceback (most recent call last):
    [rank0]:   File "/software/users/kzawora/vllm-utils/vllm_hpu_simple_test.py", line 9, in <module>
    [rank0]:     llm = LLM(model="facebook/opt-125m")
    [rank0]:   File "/software/users/kzawora/vllm-fork/vllm/entrypoints/llm.py", line 155, in __init__
    [rank0]:     self.llm_engine = LLMEngine.from_engine_args(
    [rank0]:   File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 456, in from_engine_args
    [rank0]:     engine = cls(
    [rank0]:   File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 266, in __init__
    [rank0]:     self._initialize_kv_caches()
    [rank0]:   File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 378, in _initialize_kv_caches
    [rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
    [rank0]:   File "/software/users/kzawora/vllm-fork/vllm/executor/habana_executor.py", line 89, in initialize_cache
    [rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
    [rank0]:   File "/software/users/kzawora/vllm-fork/vllm/worker/habana_worker.py", line 202, in initialize_cache
    [rank0]:     self._warm_up_model()
    [rank0]:   File "/software/users/kzawora/vllm-fork/vllm/worker/habana_worker.py", line 220, in _warm_up_model
    [rank0]:     self.model_runner.warmup_model(self.hpu_cache[0])
    [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    [rank0]:     return func(*args, **kwargs)
    [rank0]:   File "/software/users/kzawora/vllm-fork/vllm/worker/habana_model_runner.py", line 1412, in warmup_model
    [rank0]:     with compile_only_mode_context():
    [rank0]:   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    [rank0]:     return next(self.gen)
    [rank0]:   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/internal/bridge_config.py", line 20, in env_setting
    [rank0]:     get_func = globals()['get_' + var.lower()]
    [rank0]: KeyError: 'get_pt_compile_only_mode'
    inc shutdown
    inc shutdown
    inc shutdown
    inc shutdown
    ```
    
    Current behavior:
    
    ```
    ...
    INFO 09-06 17:06:42 habana_executor.py:85] # HPU blocks: 10761, # CPU blocks: 910
    INFO 09-06 17:06:43 habana_worker.py:201] Initializing cache engine took 47.29 GiB of device memory (54.34 GiB/94.62 GiB used) and -143.7 MiB of host memory (415 GiB/1007 GiB used)
    WARNING 09-06 17:06:43 habana_model_runner.py:1419] Cannot use PT_COMPILE_ONLY_MODE. Warmup time will be negatively impacted. Please update Gaudi Software Suite.
    INFO 09-06 17:06:43 habana_model_runner.py:1336] [Warmup][Prompt][1/23] batch_size:2 seq_len:1024 free_mem:40.28 GiB
    ...
    ```
    kzawora-intel authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    b50aa14 View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2024

  1. Hardcode fastapi version due to pydantic error (#255)

    Fixes serving mode issue; due to error in fastapi
    hlahkar authored Sep 9, 2024
    Configuration menu
    Copy the full SHA
    00f1333 View commit details
    Browse the repository at this point in the history
  2. Mask based BGMV implementation for LoRA Embedding (#247)

    This PR contains mask based BGMV implementation for LoRA embedding
    instead of index-select of LoRA-B weights.
    
    Removing special handling in no LoRA case also.
    vivekgoe authored Sep 9, 2024
    Configuration menu
    Copy the full SHA
    b764610 View commit details
    Browse the repository at this point in the history
  3. Eliminate graph breaks for torch.compile mode (#202)

    Eliminate two graph breaks for torch.compile mode:
    1. [__graph_breaks] torch._dynamo.exc.Unsupported: builtin: eq [<class
    'torch._dynamo.variables.misc.GetAttrVariable'>, <class
    'torch._dynamo.variables.constant.EnumVariable'>] False
    2. [__graph_breaks] torch._dynamo.exc.Unsupported: Tensor.item
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    
    ---------
    
    Signed-off-by: yuwenzho <[email protected]>
    yuwenzho authored Sep 9, 2024
    Configuration menu
    Copy the full SHA
    73af823 View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2024

  1. Port flat PA from habana_next to habana_main (#169)

    FILL IN THE PR DESCRIPTION HERE
    
    FIX #xxxx (*link existing issues this PR will resolve*)
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    
    ---------
    
    Co-authored-by: Michal Adamczyk <[email protected]>
    Co-authored-by: barak goldberg <[email protected]>
    Co-authored-by: Michal Szutenberg <[email protected]>
    Co-authored-by: Jan Kaniecki <[email protected]>
    5 people authored Sep 10, 2024
    Configuration menu
    Copy the full SHA
    5cf8441 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2fed15b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    f74fe23 View commit details
    Browse the repository at this point in the history
  4. format.sh

    kzawora-intel committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    e2c8b5a View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    4194195 View commit details
    Browse the repository at this point in the history
  6. Add disable_tensor_cache=True to HPUGraph capture (#252)

    RuntimeErrors are not observed anymore on habana_main when
    disable_tensor_cache is used. This PR enables disable_tensor_cache.
    kzawora-intel authored Sep 10, 2024
    Configuration menu
    Copy the full SHA
    4052bdb View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    c9bf908 View commit details
    Browse the repository at this point in the history
  8. Fix dispersed slots (#261)

    On habana_main the slots are calculated by adding an offset to the block
    which breaks the check for _PAD_SLOT_ID. Reworked it so that in case of
    _PAD_BLOCK_ID we're automatically inserting the right value.
    madamczykhabana authored Sep 10, 2024
    Configuration menu
    Copy the full SHA
    69df1e7 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    53f96b7 View commit details
    Browse the repository at this point in the history
  10. fix tensor parallelism

    kzawora-intel committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    d436d38 View commit details
    Browse the repository at this point in the history
  11. add missing functions

    kzawora-intel committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    61b6fbb View commit details
    Browse the repository at this point in the history

Commits on Sep 11, 2024

  1. Port PT Profiler to habana_main (#256)

    Porting PT Profiler from:
    
    81a23a7
    and
    
    e805b88
    adobrzyniewicz-habana authored Sep 11, 2024
    Configuration menu
    Copy the full SHA
    2091161 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c9bdcbe View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8e41fb5 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    68e0f57 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    b776d5e View commit details
    Browse the repository at this point in the history
  6. Fix LoRA test by handling mask creation inside the test (#270)

    This PR handles mask creation inside lora unit tests to align with new
    BGMV implementation
    vivekgoe authored Sep 11, 2024
    Configuration menu
    Copy the full SHA
    c0ff22f View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2024

  1. Attn MetaData dtype should be same as model dtype (#271)

    Attn MetaData was hard coded to bfloat16, leading to a runtime error for
    float32 model instantiation.
    hlahkar authored Sep 12, 2024
    Configuration menu
    Copy the full SHA
    f858d43 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    acf7d54 View commit details
    Browse the repository at this point in the history
  3. Fixed ALiBi (#254)

    Fixed ALiB and [MPT-7B](https://www.databricks.com/blog/mpt-7b) model.
    Accuracy results comparing to CPU(collected using
    [EleutherAI](https://github.com/EleutherAI/lm-evaluation-harness))
    
    | Tasks          | CPU    | HPU    |
    | -------------- | ------ | ------ |
    | arc_challenge  | 0.4224 | 0.4189 |
    | arc_easy       | 0.6974 | 0.6999 |
    | hellaswag      | 0.7603 | 0.7626 |
    | lambada_openai | 0.7306 | 0.7326 |
    | mmlu           | 0.293  | 0.2925 |
    | winogrande     | 0.6851 | 0.6811 |
    itaraban authored Sep 12, 2024
    Configuration menu
    Copy the full SHA
    6a734f4 View commit details
    Browse the repository at this point in the history
  4. Update gaudi-installation.rst (#279)

    Fixing ENV variables' names after flat-PA merge
    dolszewska authored Sep 12, 2024
    Configuration menu
    Copy the full SHA
    543bb6d View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    c2c1e0f View commit details
    Browse the repository at this point in the history
  6. Fix mypy issues

    kwisniewski98 committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    6b3503c View commit details
    Browse the repository at this point in the history
  7. Fix line too long

    kwisniewski98 committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    8535d53 View commit details
    Browse the repository at this point in the history
  8. Format files

    kwisniewski98 committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    27b618a View commit details
    Browse the repository at this point in the history
  9. Remove hardcoded value from softmax in flat_pa (#280)

    This PR removes the hardcoded value used to normalize softmax in flat_pa
    . Current approach is to use the global maximum as it is very easy to
    compute, but it has the drawback that other samples in a batch might
    slightly affect numerical stability.
    
    This is a first step to eliminated some of the INF/NaN issues we see in
    certain configurations and by no means this is a complete solutions.
    This needs to be revised in the future.
    madamczykhabana authored Sep 12, 2024
    Configuration menu
    Copy the full SHA
    35a4a98 View commit details
    Browse the repository at this point in the history
  10. Fix yapf detected format issue

    Signed-off-by: Chendi.Xue <[email protected]>
    xuechendi committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    046cb25 View commit details
    Browse the repository at this point in the history
  11. some update to vision model

    Signed-off-by: Chendi.Xue <[email protected]>
    xuechendi committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    aa4c59c View commit details
    Browse the repository at this point in the history
  12. resolve conflicts

    Signed-off-by: Chendi.Xue <[email protected]>
    xuechendi committed Sep 12, 2024
    Configuration menu
    Copy the full SHA
    181babf View commit details
    Browse the repository at this point in the history

Commits on Sep 13, 2024

  1. Increase garbage collector's threshold (#281)

    Increase garbage collector's threshold in order to reduce it's frequency
    kwisniewski98 authored Sep 13, 2024
    Configuration menu
    Copy the full SHA
    88b06c2 View commit details
    Browse the repository at this point in the history
  2. [Bugfix][Habana_main] fix guided_decode HPU failing issue (#236)

    FILL IN THE PR DESCRIPTION HERE
    
    FIX ##198
    
    After this change, we can see tool_calls can be returned successfully
    
    ``` bash
    Compiling FSM index for all state transitions: 100%|████████████████████████████████████████████████████████████████████████| 55/55 [00:01<00:00, 32.86it/s]INFO 09-04 02:15:34 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
    INFO 09-04 02:15:34 logger.py:36] Received request chat-0fd03b03ae05473488d9bce566401d91: prompt: "<|im_start|>user\nWhat's the weather like in Boston today?<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [27, 91, 318, 5011, 91, 29, 882, 198, 3923, 596, 279, 9282, 1093, 304, 10406, 3432, 76514, 91, 318, 6345, 91, 397, 27, 91, 318, 5011, 91, 29, 78191, 198], lora_request: None, prompt_adapter_request: None.
    INFO 09-04 02:15:34 async_llm_engine.py:173] Added request chat-0fd03b03ae05473488d9bce566401d91.
    INFO 09-04 02:15:36 async_llm_engine.py:140] Finished request chat-0fd03b03ae05473488d9bce566401d91.
    INFO:     127.0.0.1:40452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    Message: ChatCompletionMessage(content='', refusal=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-af3eac9372144f959ed0df7e16cf5da4', function=Function(arguments='{ "location": "Boston, MA", "unit": "fahrenheit" }', name='get_current_weather'), type='function')])
    ```
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    michalkuligowski authored Sep 13, 2024
    Configuration menu
    Copy the full SHA
    54c1688 View commit details
    Browse the repository at this point in the history
  3. fix rotary embedding rotary_dim not equal head_size case (#245)

    FILL IN THE PR DESCRIPTION HERE
    
    for model(like chatglm2/3-6b) whose `rotary_dim` not equal to
    `head_size`, current code will crash due to dim not equal.
    #212 have a not robust enough fix. chatglm series could work, but
    chatglm2-6b result is not correct.
    this fix follow vllm rotary_embeding pytorch native impl. verified on
    chatglm2-6b and chatglm3-6b
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    michalkuligowski authored Sep 13, 2024
    Configuration menu
    Copy the full SHA
    8a92591 View commit details
    Browse the repository at this point in the history
  4. [Bugfix][Habana_main] - dbrx model and arctic model codes fix to remo…

    …ve CUDA hardcode (#217)
    
    FILL IN THE PR DESCRIPTION HERE
    
    FIX #216
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    michalkuligowski authored Sep 13, 2024
    Configuration menu
    Copy the full SHA
    ffa7174 View commit details
    Browse the repository at this point in the history
  5. Add Dockerfile.hpu (#200)

    Add Dockerfile.hpu
    
    FIX #199
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    michalkuligowski authored Sep 13, 2024
    Configuration menu
    Copy the full SHA
    f4ac1f9 View commit details
    Browse the repository at this point in the history
  6. fix ruff detected format error

    Signed-off-by: Chendi.Xue <[email protected]>
    xuechendi committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    1a35da2 View commit details
    Browse the repository at this point in the history
  7. fix mypy format error

    Signed-off-by: Chendi.Xue <[email protected]>
    xuechendi committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    3b710a6 View commit details
    Browse the repository at this point in the history

Commits on Sep 16, 2024

  1. Configuration menu
    Copy the full SHA
    5abe4d7 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2024

  1. optimized topp/topk calculation (#195)

    ## One line description
    
    Use topk instead of sort for topp/topk calculation under certain
    conditions (scalar value of p and k).
    
    ## Details
    
    Instead of using `k` for topk, we use `_padded_k`, which is strictly
    larger than k and monotonically non decreasing.
    
    We need/use `_padded_k > k` for cases where the smallest value of the
    topk=k values has some values beyond k, (for example for
    [9,8,8,8,7,7,7], with k=3, we have [9,8,8,8], which is 4 instead of 3
    values),
    
    To prevent excessive recompilations, anytime we require an expansion of
    `_padded_k` we increment with a fixed constant `_increment` (usually
    >1), to have a bucketed approach to prevent multiple shapes
    
    
    ### Basic outline
    
    1. perform topk with `_padded_k`
    2. find the "kth" value in each row (smallest number that will be in
    topk), this is variable `num_duplicates_of_smallest_of_topk`
    3. find maximum of number of duplicates, this variable is
    `max_num_duplicates_of_smallest_of_topk`
    4. check if `_padded_k` is big enough to contain
    `max_num_duplicates_of_smallest_of_topk`. if not, then expand
    `_padded_k`, and redo the topk again with expanded `_padded_k`
    6. maskout the values that are extra in `_padded_k`
    7. move to doing topp
    
    
    ## Perf benefit
    
    ### Using benchmark_throughput.py
    
    To check benefit of this PR, make following change in
    `benchmark_throughput.py`:
    ```
    diff --git a/benchmarks/benchmark_throughput.py b/benchmarks/benchmark_throughput.py
    index ff33e3dc..3383dea8 100644
    --- a/benchmarks/benchmark_throughput.py
    +++ b/benchmarks/benchmark_throughput.py
    @@ -116,8 +116,9 @@ def run_vllm(
             sampling_params.append(
                 SamplingParams(
                     n=n,
    -                temperature=0.0 if use_beam_search else 1.0,
    -                top_p=1.0,
    +                temperature=1.0,  #0.0 if use_beam_search else 1.0,
    +                top_p=0.95,
    +                top_k=20,
                     use_beam_search=use_beam_search,
                     ignore_eos=True,
                     max_tokens=output_len,
    
     ```
    
    
    `VLLM_SKIP_WARMUP=true VLLM_GRAPH_RESERVED_MEM=0.2 VLLM_GRAPH_PROMPT_RATIO=0.8 VLLM_DECODE_BS_BUCKET_MIN=1 VLLM_DECODE_BLOCK_BUCKET_STEP=64 VLLM_DECODE_BLOCK_BUCKET_MIN=64 python benchmark_throughput.py --model /root/sasarkar/llama3-8b/ --device hpu --seed 2024 --backend vllm --num-prompts 100 --dtype bfloat16 --input-len=256 --output-len=512`
    
    in the numbers below there is a **49%** increase in thruput in the case with warmup, and **30%** increase in thruput in the case without warmup
    
    
    #### with opt + warmup
    
    Processed prompts: 100%|█████████████████████████████████████████████████████████████████████| 100/100 [00:22<00:00,  4.37it/s, est. speed input: 1119.66 toks/s, output: 2239.33 toks/s]
    Throughput: 4.37 requests/s, 3354.58 tokens/s
    
    
    #### with opt + skip warmup
    
    Processed prompts: 100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:46<00:00,  2.17it/s, est. speed input: 556.32 toks/s, output: 1112.63 toks/s]
    Throughput: 2.17 requests/s, 1667.89 tokens/s
    
    
    #### without opt + warmup
    
    Processed prompts: 100%|██████████████████████████████████████████████████████████████████████| 100/100 [00:34<00:00,  2.93it/s, est. speed input: 749.24 toks/s, output: 1498.48 toks/s]
    Throughput: 2.92 requests/s, 2245.74 tokens/s
    
    
    #### without opt + skip warmup
    
    Processed prompts: 100%|███████████████████████████████████████████████████████████████████████| 100/100 [00:59<00:00,  1.67it/s, est. speed input: 428.49 toks/s, output: 856.99 toks/s]
    Throughput: 1.67 requests/s, 1284.85 tokens/s
    
    ### Using server Client
    (Data collected by Peter)
    [baseline](https://github.com/HabanaAI/vllm-fork/commits/a7763a7a76b4531ed7907549724df2949d9225bf/)
    all numbers on 1.17-495
    third column [branch ](https://github.com/HabanaAI/vllm-fork/commits/ae_benchmark_9_10_24/)
    
    | model | TP | baseline HPU thruput   | baseline HPU + this PR thruput | baseline HPU + this PR + other opt | 
    | -------- | ------- | ------- | ------- | ------- |
    | llama3 8b | 1 | 950  | 1296    | 1306 | 
    | llama3 8b | 4 | 1347  | 1969    | 2077 | 
    | llama3 70b | 4 | 368  | 394    | 394 | 
    | qwen 72b | 4 | 731  | 726    | 815 |
    
    
    ### Without delayed sampling 
    On habana_main f858d43
    ```VLLM_GRAPH_RESERVED_MEM=0.2 VLLM_GRAPH_PROMPT_RATIO=0.8
    VLLM_DECODE_BS_BUCKET_MIN=1 VLLM_DECODE_BLOCK_BUCKET_STEP=64
    VLLM_DECODE_BLOCK_BUCKET_MIN=64 python benchmark_throughput.py --model
    /root/sasarkar/llama3-8b/ --device hpu --seed 2024 --backend vllm
    --num-prompts 100 --dtype bfloat16 --input-len=256 --output-len=512```
    
    Without change
    Throughput: 3.32 requests/s, 2550.85 tokens/s
    
    With change:
    Throughput: 5.17 requests/s, 3967.58 tokens/s
    
    
    
    
    ## Extra Notes
    1. Works only for "scalar" case, though it might be possible to extend
    the basic idea (topk instead of sort) for vector case as well. (Outline
    of this is: find max k in topk vector, then perform topk using that,
    etc. needs some bucketing possibly to prevent dyn shapes etc)
    2. Need an additional check in `_init_sampling_tensors` to determine if
    its scalar case. This has a minor perf hit. ideally if someone could
    tell us that its a scalar from the top itself...
    3. Some tradeoffs can be made, where we use a sufficiently large
    padded_k (which is still smaller than vocab size) from the beginning,
    and hope that every case lands within that bucket. Cases that wont land
    are expected to be very, very rare. For example if padded_k = max(2 * k,
    100) is used, and k = say 50, then we need the smallest of the topk
    value to repeat 50 times with same probability, which is exceedingly
    unlikely. If we trade off this mathematical improbability, then we can
    do with only 1 topk op, which might be faster
    4. There is a `fliplr` in the code, which could be removed, if we can
    compute reverse cumsum. however the formula for reverse cumsum as
    expressed [here ](pytorch/pytorch#33520), ` x
    + torch.sum(x, dim=1, keepdims=True) - torch.cumsum(x, dim=1)` is
    numerically unstable, because of the addition/subtraction. It works well
    enough on ints and large numbers, but not on small probability values.
    5. The value of `k` affects the gains we might get from this. For
    example in the expt shown above, with k=20, thruput increases from
    1284.85 to 1667.89 (30% gain). But if k = 2000, instead of 20,
    throughput increases from 1127.34 to 1289.26 (14% gain). Thus the gain %
    might decrease with increasing k, as asymptotically topk would probably
    converges to sort's performance for large k. However practically k is
    pretty small.
    6. For larger models, the gains may be less, as they are more device
    bound probably
    7. Cumsum may be taking long. Maybe try below. [Initial
    try](b392ff8)
    ```
    import torch
    y = torch.tensor([[1,2,3], [4,5,6]])
    mask1 = torch.tensor([[[1,0,0],[1,1,0],[1,1,1]], [[1,0,0],[1,1,0],[1,1,1]]])
    torch.sum(y.unsqueeze(1)*mask1,2)
    ```
    or
    ```
    F.conv1d(torch.tensor([[[0,0,0,0,1,2,3,4,5]], [[0,0,0,0,6,7,8,9,10.0]]]), torch.ones([1,1,5], dtype=torch.float32))
    ```
    FIX #xxxx (*link existing issues this PR will resolve*)
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    michalkuligowski authored Sep 17, 2024
    Configuration menu
    Copy the full SHA
    4c1ca3a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    1a712d5 View commit details
    Browse the repository at this point in the history
  3. [Bugfix][Habana_main] fix multi-modal model inference - tested with l…

    …lava-1.5 (#283)
    
    FILL IN THE PR DESCRIPTION HERE
    
    FIX #282  (*link existing issues this PR will resolve*)
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    michalkuligowski authored Sep 17, 2024
    Configuration menu
    Copy the full SHA
    44c4f93 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    a9de5ba View commit details
    Browse the repository at this point in the history
  5. Update documentation on support of fp8 (#288)

    Update documentation on support of fp8
    michalkuligowski authored Sep 17, 2024
    Configuration menu
    Copy the full SHA
    d39298c View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    ed19acd View commit details
    Browse the repository at this point in the history
  7. Removed vllm.hpu directory and changed relevant imports (#291)

    Moved files from vllm/hpu to another public repo:
    https://github.com/HabanaAI/vllm-hpu-extension
    It can be installed with pip install
    git+https://github.com/HabanaAI/vllm-hpu-extension.git
    tzielinski-habana authored Sep 17, 2024
    Configuration menu
    Copy the full SHA
    6a96d9b View commit details
    Browse the repository at this point in the history
  8. Reduce default value of VLLM_GRAPH_RESERVED_MEM to 0.1 (#292)

    After #252, HPUGraph capture
    takes much less memory, and we can reduce the memory reserved for
    HPUGraphs. On Llama3.1-8b-Instruct (G2), capturing 100% of prefill and
    decode graphs on BS=256 now takes 1.566 GB of HBM, which is far less
    than 40% (~30 GB) we reserve by default. This results in lots of unused
    (==wasted) memory, which could be used instead for more KV cache blocks.
    michalkuligowski authored Sep 17, 2024
    Configuration menu
    Copy the full SHA
    47a89be View commit details
    Browse the repository at this point in the history
  9. fix minor logging issue

    schoi-habana committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    18d6339 View commit details
    Browse the repository at this point in the history

Commits on Sep 18, 2024

  1. Fix minor logging issue in habana_model_runner.py (#294)

    The original code doesn't print the default value correctly
    
    INFO 09-17 00:06:07 habana_model_runner.py:95]
    VLLM_PROMPT_BS_BUCKET_MIN=1 (default:_**min**_)
    INFO 09-17 00:06:07 habana_model_runner.py:95]
    VLLM_PROMPT_BS_BUCKET_STEP=1 (default:_**step**_)
    INFO 09-17 00:06:07 habana_model_runner.py:95]
    VLLM_PROMPT_BS_BUCKET_MAX=1 (default:_**max**_)
    
    This change make it print the correct default value
    INFO 09-17 21:30:51 habana_model_runner.py:95]
    VLLM_PROMPT_BS_BUCKET_MIN=1 (default:_**1**_)
    INFO 09-17 21:30:51 habana_model_runner.py:95]
    VLLM_PROMPT_BS_BUCKET_STEP=4 (default:_**32**_)
    INFO 09-17 21:30:51 habana_model_runner.py:95]
    VLLM_PROMPT_BS_BUCKET_MAX=4 (default:_**64**_)
    michalkuligowski authored Sep 18, 2024
    Configuration menu
    Copy the full SHA
    83b54e9 View commit details
    Browse the repository at this point in the history
  2. Fix blocks number calculation for Flat PA (#269)

    Fix blocks number calculation for Flat PA via adding empty table_block
    (#158)
    iboiko-habana authored Sep 18, 2024
    Configuration menu
    Copy the full SHA
    b62fba8 View commit details
    Browse the repository at this point in the history

Commits on Sep 19, 2024

  1. Configuration menu
    Copy the full SHA
    347f9c7 View commit details
    Browse the repository at this point in the history

Commits on Sep 20, 2024

  1. Remove dummy seq group data creation from loop (#301)

    Remove dummy seq metadata from loop for Flat PA fix
    iboiko-habana authored Sep 20, 2024
    Configuration menu
    Copy the full SHA
    cd7b1c1 View commit details
    Browse the repository at this point in the history
  2. optimize qwen2 model on Gaudi (#233)

    Add extra mark_step() on each decode layer to optimize the performance
    on Gaudi.
    
    Signed-off-by: Bob Zhu <[email protected]>
    czhu15 authored Sep 20, 2024
    Configuration menu
    Copy the full SHA
    12d7033 View commit details
    Browse the repository at this point in the history
  3. fix bug: device_str in initialize_ray_cluster requires uppercase stri…

    …ng (#297)
    
    fix bug: device_str in initialize_ray_cluster requires uppercase string
    
    w/o the bug fix, multi HPUs will encounter "ValueError: The number of
    required hpus exceeds the total number of available hpus in the
    placement group" error, as the device_str is not expected as uppercase,
    then available hpus always returns 0.
    hlin99 authored Sep 20, 2024
    Configuration menu
    Copy the full SHA
    bc39baa View commit details
    Browse the repository at this point in the history
  4. Fix Lora Rebase (#290)

    Fixes Lora Related issues in vllm Rebase
    hlahkar authored Sep 20, 2024
    Configuration menu
    Copy the full SHA
    b2653ab View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    82960d8 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    f4d2097 View commit details
    Browse the repository at this point in the history
  7. add missing files

    kzawora-intel committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    9f8b8e7 View commit details
    Browse the repository at this point in the history
  8. format.sh

    kzawora-intel committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    346139d View commit details
    Browse the repository at this point in the history
  9. more format.sh

    kzawora-intel committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    6d45443 View commit details
    Browse the repository at this point in the history
  10. gha update

    kzawora-intel committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    3a0ff3b View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    6502b91 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    7057da5 View commit details
    Browse the repository at this point in the history
  13. oh come on now

    kzawora-intel committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    43df762 View commit details
    Browse the repository at this point in the history
  14. fix fakehpu mode

    kzawora-intel committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    3134b8a View commit details
    Browse the repository at this point in the history

Commits on Sep 23, 2024

  1. Fix calculating slots for warmup (#310)

    Recent changes broke slot sparsity for warmup slots. This commit
    restores the functionality.
    madamczykhabana authored Sep 23, 2024
    Configuration menu
    Copy the full SHA
    f92ffc1 View commit details
    Browse the repository at this point in the history
  2. Removed padding block from a list of available blocks in allocators (#…

    …313)
    
    Block 0 is used for padding. This PR removes the padding block from a
    list of available blocks in block allocators v1 and v2
    tzielinski-habana authored Sep 23, 2024
    Configuration menu
    Copy the full SHA
    63fae51 View commit details
    Browse the repository at this point in the history
  3. Fix seq_len for padding sequences (#318)

    Before the fix we used seq_len=0 for padding samples. This was later
    translated to an empty attention_mask (since we don't have any tokens
    that we should include in calculations) and in turn caused NaNs in
    prompt attention (0 divided by 0). Those NaNs later got propagated to
    kv-cache causing issues in flat_pa.
    madamczykhabana authored Sep 23, 2024
    Configuration menu
    Copy the full SHA
    aa507d4 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    b70a8c2 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    a844837 View commit details
    Browse the repository at this point in the history
  6. Fix lora specific conditions in profile-run (#317)

    #256 breaks LoRA specific flow
    which was handled through `is_profile_run` flag to distinguish warmup
    and profile-run phase.
    
    Introduces a new flag `is_lora_profile_run` to handle this LoRA specific
    flow in profile-run.
    vivekgoe authored Sep 23, 2024
    Configuration menu
    Copy the full SHA
    084db0f View commit details
    Browse the repository at this point in the history
  7. TP fixes

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    a9f94be View commit details
    Browse the repository at this point in the history
  8. Run with HPU graphs even when warmup was skipped (#320)

    Before that PR we relied on stored information which configuration
    should have HPU graphs enabled. Unfortunately that set was computed
    during warmup. If we skipped warmup we didn't had that information. This
    PR allows to run all buckets with HPU graphs enabled when warmup is
    skipped.
    madamczykhabana authored Sep 23, 2024
    Configuration menu
    Copy the full SHA
    9bb65b7 View commit details
    Browse the repository at this point in the history
  9. mixtral api fixes

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    2a499c7 View commit details
    Browse the repository at this point in the history
  10. revert debug prints

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    9372734 View commit details
    Browse the repository at this point in the history
  11. format.sh

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    c15ddd2 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    f5d254d View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    e00ab5a View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    3bb593a View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    f9b222e View commit details
    Browse the repository at this point in the history
  16. prune the easy parts

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    2f23cb7 View commit details
    Browse the repository at this point in the history
  17. prune more easy parts

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    28df6fd View commit details
    Browse the repository at this point in the history
  18. prune lora files

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    c6d2d5a View commit details
    Browse the repository at this point in the history
  19. prune unnecessary docs

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    97c398e View commit details
    Browse the repository at this point in the history
  20. Configuration menu
    Copy the full SHA
    6a913b3 View commit details
    Browse the repository at this point in the history
  21. Move profilers to vllm-hpu-extension (#323)

    Continuation of HabanaAI/vllm-hpu-extension#4
    
    I've also removed is_tpu, as it got mistakenly restored in the rebase.
    It's not in the upstream.
    kzawora-intel authored Sep 23, 2024
    Configuration menu
    Copy the full SHA
    c64dc83 View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    f56953f View commit details
    Browse the repository at this point in the history
  23. Revert "Add fake HPU mode to Habana components with dummy habana_fram…

    …eworks module. (#250)"
    
    This reverts commit a9de5ba.
    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    c562b02 View commit details
    Browse the repository at this point in the history
  24. fix revert

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    cf3bbd2 View commit details
    Browse the repository at this point in the history
  25. Revert "Initial commit"

    This reverts commit 2ab316d.
    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    09357b4 View commit details
    Browse the repository at this point in the history
  26. cleanup

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    3713da8 View commit details
    Browse the repository at this point in the history
  27. remove redundant import

    kzawora-intel committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    bb6564a View commit details
    Browse the repository at this point in the history

Commits on Sep 24, 2024

  1. Restore upstream requirements-build.txt (#324)

    At some point, someone added whitespaces to each entry in
    requirements-build.txt. Upstream does not contain it. Easy fix.
    kzawora-intel authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    c968320 View commit details
    Browse the repository at this point in the history
  2. Remove reminder_comment.yml workflow (#325)

    This workflow never worked properly in the fork. This PR removes it.
    kzawora-intel authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    58d5cde View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    cf4c3e5 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    aa5edcc View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    f6ff4a7 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    a000e62 View commit details
    Browse the repository at this point in the history
  7. Fix doc build warnings (#330)

    This PR fixes all the little warnings gaudi-installation.rst introduces
    during documentation build ("WARNING: Title underline too short." etc.)
    kzawora-intel authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    41217cf View commit details
    Browse the repository at this point in the history
  8. fix qwen2 model issue (#329)

    FILL IN THE PR DESCRIPTION HERE
    typo: `platform` -> `platforms`
    
    FIX #xxxx (*link existing issues this PR will resolve*)
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Adding or changing kernels</h3>
    <p>Each custom kernel needs a schema and one or more implementations to
    be registered with PyTorch.</p>
    <ul>
    <li>Make sure custom ops are registered following PyTorch guidelines: <a
    href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom
    C++ and CUDA Operators</a> and <a
    href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The
    Custom Operators Manual</a></li>
    <li>Custom operations that return <code>Tensors</code> require
    meta-functions. Meta-functions should be implemented and registered in
    python so that dynamic dims can be handled automatically. See above
    documents for a description of meta-functions.</li>
    <li>Use <a
    href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a>
    to test the function registration and meta-function for any registered
    ops. See <code>tests/kernels</code> for examples.</li>
    <li>When changing the C++ signature of an existing op, the schema must
    be updated to reflect the changes.</li>
    <li>If a new custom type is needed, see the following document: <a
    href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom
    Class Support in PT2</a>.
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    jikunshang authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    4eb9809 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    c1232e9 View commit details
    Browse the repository at this point in the history
  10. update docs

    kzawora-intel committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    20c87dd View commit details
    Browse the repository at this point in the history
  11. Remove vllm.utils.is_hpu() (#331)

    vllm.utils.is_hpu() was redundant for some time now and has always been
    problematic particularly for torch.compile mode. Now, we're fully
    switching to current_platform.is_hpu().
    kzawora-intel authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    9be37a3 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    c90e153 View commit details
    Browse the repository at this point in the history
  13. remove get_device

    kzawora-intel committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    874f3d8 View commit details
    Browse the repository at this point in the history
  14. Remove logger from layernorm (#332)

    Upstream does not use logger in layernorm. Neither do we. No idea why
    it's there.
    kzawora-intel authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    e16918d View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    18b0e98 View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    347380f View commit details
    Browse the repository at this point in the history
  17. Fix INC FP8 inference after rebase (#333)

    This PR fixes the "RuntimeError: HPU does not have device capability."
    error introduced after rebase & fixes loading weights on CPU for
    quantization.
    kzawora-intel authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    73f4b48 View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    fc1cf5e View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    e2f72e3 View commit details
    Browse the repository at this point in the history
  20. Configuration menu
    Copy the full SHA
    b582d77 View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    b90adac View commit details
    Browse the repository at this point in the history
  22. WA for none load device

    kzawora-intel committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    d853eeb View commit details
    Browse the repository at this point in the history
  23. Make weights_load_device not change EngineArgs.create_load_config() (#…

    …336)
    
    Some backends rely on calling EngineArgs.create_load_config() directly,
    for which we've altered the API. We don't need to alter it to enable
    weight load device functionality. This PR fixes it.
    kzawora-intel authored Sep 24, 2024
    Configuration menu
    Copy the full SHA
    9111a80 View commit details
    Browse the repository at this point in the history
  24. device type

    kzawora-intel committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    db8dbce View commit details
    Browse the repository at this point in the history
  25. Revert "fix guided_decode HPU failing issue"

    This reverts commit 8046d81.
    kzawora-intel committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    c337e93 View commit details
    Browse the repository at this point in the history
  26. load device fix

    kzawora-intel committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    e8e369f View commit details
    Browse the repository at this point in the history

Commits on Sep 25, 2024

  1. Refine INC shutdown code (#335)

    This PR removes debug printouts in INC shutdown method and covers the
    case where application exits before model is initialized properly.
    kzawora-intel authored Sep 25, 2024
    Configuration menu
    Copy the full SHA
    8c6dcae View commit details
    Browse the repository at this point in the history
  2. Setting enough cache_size_limit for torch.compile warmup (#238)

    Fix the issue that warmup sometimes doesn't work because the default
    cache_size_limit is only 8 .
    
    ---------
    
    Signed-off-by: zehao-intel <[email protected]>
    Co-authored-by: Andrzej Kotłowski <[email protected]>
    zehao-intel and anko-intel authored Sep 25, 2024
    Configuration menu
    Copy the full SHA
    cef2f54 View commit details
    Browse the repository at this point in the history
  3. Change default values for decode bucket flags (#316)

    Change default values for decode bucket flags
    iboiko-habana authored Sep 25, 2024
    Configuration menu
    Copy the full SHA
    45ee586 View commit details
    Browse the repository at this point in the history
  4. Support loading checkpoints quantized using Autofp8 (#286)

    Support loading
    https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127
    
    Skip cuda checks
    Use scaled_fp8_quant instead of _scaled_mm
    Fix weights and weight_scale for guudi2 flot8_e4m3fn range.
    
    ---------
    
    Co-authored-by: Nir David <[email protected]>
    Co-authored-by: Konrad Zawora <[email protected]>
    3 people authored Sep 25, 2024
    Configuration menu
    Copy the full SHA
    29fb5ed View commit details
    Browse the repository at this point in the history

Commits on Sep 26, 2024

  1. Fix torch.compile issue of dispatch key set mismatch (#299)

    ### Issue:
    torch.compile recompiles after warmup because `tensor 'L['input_ids']'
    dispatch key set mismatch. expected DispatchKeySet(HPU, BackendSelect),
    actual DispatchKeySet(HPU, BackendSelect, ADInplaceOrView). `
    
    ### Detail:
    Run script with `TORCH_LOGS="guards"` and get different dispatch key set
    info:
    - warmup:
    ```
    TENSOR_MATCH: check_tensor(L['input_ids'], Tensor, DispatchKeySet(HPU, BackendSelect), torch.int64, device=0, requires_grad=False, size=[2, 1], stride=[1, 1])  # masked_input = input_  # ome/zyuwen/workspace/vllm/habana_main_g3_v2/vllm/model_executor/layers/vocab_parallel_embedding.py:358 in forward
    ```
    - after warmup:
    ```
    TENSOR_MATCH: check_tensor(L['input_ids'], Tensor, DispatchKeySet(HPU, BackendSelect, ADInplaceOrView), torch.int64, device=0, requires_grad=False, size=[2, 1], stride=[1, 1])  # masked_input = input_  # ome/zyuwen/workspace/vllm/habana_main_g3_v2/vllm/model_executor/layers/vocab_parallel_embedding.py:358 in forward 
    ```
    ### Solution:
    The difference in dispatch key set is caused by the
    'torch.inference_mode()' decoration, and here is a simple example:
    ```python
    import torch
    import habana_frameworks.torch as htorch
    
    @torch.inference_mode()
    def func():    
        x = torch.rand(3, 3).to("hpu")    
        print(torch._C._dispatch_key_set(x))
    func() 
    # output: DispatchKeySet(HPU, AutocastHPU)
    ```
    ```python
    import torch
    import habana_frameworks.torch as htorch 
    
    def func():    
        x = torch.rand(3, 3).to("hpu")    
        print(torch._C._dispatch_key_set(x)) 
    func() 
    # output: DispatchKeySet(HPU, ADInplaceOrView, AutogradHPU, AutocastHPU) 
    ```
    
    In vllm-fork, the warmup phase is decorated with
    `torch.inference_mode()` in
    [habana_model_runner.py#L1487-L1488](https://github.com/HabanaAI/vllm-fork/blob/b62fba85ac03326e9f466d8d37e91ae1b14a6511/vllm/worker/habana_model_runner.py#L1487-L1488),
    but the after-warmup phase is not.
    
    So in this PR I add the decorator to `prepare_input_tensors` function to
    keep the dispatch key set the same.
    
    
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    
    Signed-off-by: yuwenzho <[email protected]>
    yuwenzho authored Sep 26, 2024
    Configuration menu
    Copy the full SHA
    4c8a6c6 View commit details
    Browse the repository at this point in the history
  2. Chunk prefill cache writes, remove div_i32 from insert_or_update_cache (

    #289)
    
    Re-implements following PRs for current habana_main:
    #102 (Removing div_i32
    operations from each layer)
    #115 (removing scatter for
    reshape&cache in case of prompt)
    
    Accuracy (GSM8K on Llama3.1-8B-Instruct):
    | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
    
    |---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
    |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101|
    | | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101|
    
    I've benchmarked this change on Llama3.1-8B-Instruct and on average,
    +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can
    be observed across all prefill buckets on G2, with up to +4.40% (+956.79
    tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound
    scenarios.
    kzawora-intel authored Sep 26, 2024
    Configuration menu
    Copy the full SHA
    1c6bada View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    fccaca0 View commit details
    Browse the repository at this point in the history
  4. Update cpu-test.yml

    kzawora-intel authored Sep 26, 2024
    Configuration menu
    Copy the full SHA
    5ffcfa3 View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2024

  1. Fix runtime errors reported when using long input sequence lengths wi…

    …th LoRA (#339)
    
    This PR has following fixes,
    - Increase size of indices tensors used to maintain multi-lora state
    information from max_num_batched_tokens to 3*max_num_batched_tokens.
    This increase is done to provide buffer for padding done in batch &
    sequence dimensions.
    - Move logic to remove padding from lora_logits from execute_model()
    back to Class LogitsProcessorWithLoRA, this is done to fix race
    condition caused by updating multi-lora state information directly.
    
    FIX #237
    vivekgoe authored Sep 27, 2024
    Configuration menu
    Copy the full SHA
    c3577af View commit details
    Browse the repository at this point in the history
  2. vLLM 0.6.2 rebase (#340)

    you know the drill
    kzawora-intel authored Sep 27, 2024
    Configuration menu
    Copy the full SHA
    f347a84 View commit details
    Browse the repository at this point in the history
  3. Enable Async output process for HPU (#342)

    FILL IN THE PR DESCRIPTION HERE
    
    This PR refer to [vllm-project#7049](vllm-project#7049)
    to implement Asynchronous Output Processor on HPU. It is open by
    default, to disable it, please pass the `--disable_async_output_proc`
    flag.
    
    From my local test on latest habana_main branch(commit
    29fb5ed), the throughput improves from
    3847 TPS to 4011 TPS.
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Adding or changing kernels</h3>
    <p>Each custom kernel needs a schema and one or more implementations to
    be registered with PyTorch.</p>
    <ul>
    <li>Make sure custom ops are registered following PyTorch guidelines: <a
    href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom
    C++ and CUDA Operators</a> and <a
    href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The
    Custom Operators Manual</a></li>
    <li>Custom operations that return <code>Tensors</code> require
    meta-functions. Meta-functions should be implemented and registered in
    python so that dynamic dims can be handled automatically. See above
    documents for a description of meta-functions.</li>
    <li>Use <a
    href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a>
    to test the function registration and meta-function for any registered
    ops. See <code>tests/kernels</code> for examples.</li>
    <li>When changing the C++ signature of an existing op, the schema must
    be updated to reflect the changes.</li>
    <li>If a new custom type is needed, see the following document: <a
    href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom
    Class Support in PT2</a>.
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    zhouyu5 authored Sep 27, 2024
    Configuration menu
    Copy the full SHA
    ed85058 View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2024

  1. Port last_bucket change from v1.18.0 (#347)

    Port last_bucket change from v1.18.0
    iboiko-habana authored Sep 30, 2024
    Configuration menu
    Copy the full SHA
    b611e20 View commit details
    Browse the repository at this point in the history
  2. Add setuptools_scm to requirements-hpu.txt (#349)

    This removes the crash during installation for dependency that's inside
    requirements-build.txt
    kzawora-intel authored Sep 30, 2024
    Configuration menu
    Copy the full SHA
    3010f8c View commit details
    Browse the repository at this point in the history
  3. test_lora_manager fix

    rsshaik1 committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    44d8173 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    188bd3a View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    f59495a View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    b0a9d02 View commit details
    Browse the repository at this point in the history

Commits on Oct 1, 2024

  1. Configuration menu
    Copy the full SHA
    70f544c View commit details
    Browse the repository at this point in the history
  2. Added changes of HPU flags

    rsshaik1 committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    ec34f88 View commit details
    Browse the repository at this point in the history
  3. Fixed lora manager tests (#315)

    Added the hpu related changes along with gpu to conftest.py file and
    test_lora_manager_hpu.py
    vivekgoe authored Oct 1, 2024
    Configuration menu
    Copy the full SHA
    c7b1509 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    cafff17 View commit details
    Browse the repository at this point in the history

Commits on Oct 2, 2024

  1. Oct 01 rebase (#353)

    kzawora-intel authored Oct 2, 2024
    Configuration menu
    Copy the full SHA
    25f4ed9 View commit details
    Browse the repository at this point in the history

Commits on Oct 3, 2024

  1. Lora Mask based on lora index (#348)

    Changes the filling of lora mask from lora_id to lora_index. This is
    needed to ensure that the mask does not fail in case lora id is greater
    than max_loras
    hlahkar authored Oct 3, 2024
    Configuration menu
    Copy the full SHA
    da03d8b View commit details
    Browse the repository at this point in the history
  2. Add rope_scaling support for LLama3.1 (#356)

    Add support for rope scaling and FusedRoPE in LLama3.1
    kdamaszk authored Oct 3, 2024
    Configuration menu
    Copy the full SHA
    f848d27 View commit details
    Browse the repository at this point in the history

Commits on Oct 4, 2024

  1. [Core] Support Torch profiler in Habana Worker (#357)

    This PR allows to profile execution on HPU through flag
    VLLM_TORCH_PROFILER_DIR. Similar as it is done for GPU.
    The profiling can be controlled:
    1. Asynchronously by posting requests to the server:
    a) to start collecting profile:
    `
    curl -X POST http://localhost:8080/start_profile
    `
    b) to stop collecting profile:
    `
    curl -X POST http://localhost:8080/stop_profile
    `
    2. In script, by instructing LLM object to start and stop profiling:
    ```python
    from vllm import LLM, SamplingParams
    llm = LLM(...)
    llm.start_profile()
    llm.stop_profile()
    ```
    mswiniarsk authored Oct 4, 2024
    Configuration menu
    Copy the full SHA
    d8ba780 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    250487b View commit details
    Browse the repository at this point in the history
  3. oopsie

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    eb095b3 View commit details
    Browse the repository at this point in the history
  4. format.sh

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    65fa6f6 View commit details
    Browse the repository at this point in the history
  5. make yapf happy

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    0576360 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    7f73cc9 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    b4e26d3 View commit details
    Browse the repository at this point in the history
  8. [Refactor] Rename components *Habana* -> *HPU* (#359)

    Refactoring Gaudi-specific components to use `hpu` name instead of
    `habana` (e.g. `habana_model_runner.py` -> `hpu_model_runner.py`,
    `habana_executor.py` -> `hpu_executor.py`, etc.), as suggested in the
    upstream PR.
    kzawora-intel authored Oct 4, 2024
    Configuration menu
    Copy the full SHA
    cfe231d View commit details
    Browse the repository at this point in the history
  9. Oct 04 rebase (#360)

    kzawora-intel authored Oct 4, 2024
    Configuration menu
    Copy the full SHA
    38e60f4 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    76cbbb5 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    95a7ece View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    d7d609f View commit details
    Browse the repository at this point in the history
  13. remove lora test

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    c07cbc6 View commit details
    Browse the repository at this point in the history
  14. revert FP8 changes

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    d90bbce View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    84dc6c5 View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    f7288de View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    6899c3f View commit details
    Browse the repository at this point in the history
  18. fp8 leftovers

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    e5d640e View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    25388e2 View commit details
    Browse the repository at this point in the history
  20. Configuration menu
    Copy the full SHA
    b4f7ffa View commit details
    Browse the repository at this point in the history
  21. oopsie

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    43959db View commit details
    Browse the repository at this point in the history
  22. format.sh

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    b8404ad View commit details
    Browse the repository at this point in the history
  23. fix comment length

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    d38564f View commit details
    Browse the repository at this point in the history
  24. Merge remote-tracking branch 'origin/private/kzawora/hpu_attn' into p…

    …rivate/kzawora/pruned_habana_main
    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    eed1b05 View commit details
    Browse the repository at this point in the history
  25. Merge remote-tracking branch 'origin/private/kzawora/hpu_bf16_default…

    …' into private/kzawora/pruned_habana_main
    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    5c3e29c View commit details
    Browse the repository at this point in the history
  26. fix comment

    kzawora-intel committed Oct 4, 2024
    Configuration menu
    Copy the full SHA
    33c1db0 View commit details
    Browse the repository at this point in the history
  27. Configuration menu
    Copy the full SHA
    05777e0 View commit details
    Browse the repository at this point in the history

Commits on Oct 7, 2024

  1. Configuration menu
    Copy the full SHA
    1f6de5d View commit details
    Browse the repository at this point in the history
  2. [Refactor] Rename HabanaAttention -> HPUAttention (#362)

    I've missed the attention backend in
    #359
    kzawora-intel authored Oct 7, 2024
    Configuration menu
    Copy the full SHA
    ad08dd4 View commit details
    Browse the repository at this point in the history
  3. Use BF16 on HPU by default (#361)

    We don't *officially* support FP16, and for the most part, we use BF16
    wherever we can. This removes the need of specifying `--dtype bfloat16`
    - when `dtype` is not provided (is `auto`), and model default data type
    is `float16`, we cast it to `bfloat16` for HPU.
    kzawora-intel authored Oct 7, 2024
    Configuration menu
    Copy the full SHA
    e00750e View commit details
    Browse the repository at this point in the history
  4. Set vllm-hpu-extension to 36c7f9c (#365)

    This includes: HabanaAI/vllm-hpu-extension#8
    (BlockSoftmax: fix guard value for fp16)
    madamczykhabana authored Oct 7, 2024
    Configuration menu
    Copy the full SHA
    db5aed6 View commit details
    Browse the repository at this point in the history
  5. Add AliBi to supported features in README_GAUDI.md (#287)

    ALiBi was fixed in #254, so it
    should be added to supported features in README.
    kzawora-intel authored Oct 7, 2024
    Configuration menu
    Copy the full SHA
    902f575 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    27c05e1 View commit details
    Browse the repository at this point in the history
  7. format.sh

    kzawora-intel committed Oct 7, 2024
    Configuration menu
    Copy the full SHA
    bb4c23e View commit details
    Browse the repository at this point in the history
  8. Fix hpu_set_env call in load_model in vllm (#364)

    FILL IN THE PR DESCRIPTION HERE
    
    FIX #xxxx (*link existing issues this PR will resolve*)
    
    **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE
    DESCRIPTION ABOVE**
    
    ---
    
    <details>
    <!-- inside this <details> section, markdown rendering does not work, so
    we use raw html here. -->
    <summary><b> PR Checklist (Click to Expand) </b></summary>
    
    <p>Thank you for your contribution to vLLM! Before submitting the pull
    request, please ensure the PR meets the following criteria. This helps
    vLLM maintain the code quality and improve the efficiency of the review
    process.</p>
    
    <h3>PR Title and Classification</h3>
    <p>Only specific types of PRs will be reviewed. The PR title is prefixed
    appropriately to indicate the type of change. Please use one of the
    following:</p>
    <ul>
        <li><code>[Bugfix]</code> for bug fixes.</li>
    <li><code>[CI/Build]</code> for build or continuous integration
    improvements.</li>
    <li><code>[Doc]</code> for documentation fixes and improvements.</li>
    <li><code>[Model]</code> for adding a new model or improving an existing
    model. Model name should appear in the title.</li>
    <li><code>[Frontend]</code> For changes on the vLLM frontend (e.g.,
    OpenAI API server, <code>LLM</code> class, etc.) </li>
    <li><code>[Kernel]</code> for changes affecting CUDA kernels or other
    compute kernels.</li>
    <li><code>[Core]</code> for changes in the core vLLM logic (e.g.,
    <code>LLMEngine</code>, <code>AsyncLLMEngine</code>,
    <code>Scheduler</code>, etc.)</li>
    <li><code>[Hardware][Vendor]</code> for hardware-specific changes.
    Vendor name should appear in the prefix (e.g.,
    <code>[Hardware][AMD]</code>).</li>
    <li><code>[Misc]</code> for PRs that do not fit the above categories.
    Please use this sparingly.</li>
    </ul>
    <p><strong>Note:</strong> If the PR spans more than one category, please
    include all relevant prefixes.</p>
    
    <h3>Code Quality</h3>
    
    <p>The PR need to meet the following code quality standards:</p>
    
    <ul>
    <li>We adhere to <a
    href="https://google.github.io/styleguide/pyguide.html">Google Python
    style guide</a> and <a
    href="https://google.github.io/styleguide/cppguide.html">Google C++
    style guide</a>.</li>
    <li>Pass all linter checks. Please use <a
    href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a>
    to format your code.</li>
    <li>The code need to be well-documented to ensure future contributors
    can easily understand the code.</li>
    <li>Include sufficient tests to ensure the project to stay correct and
    robust. This includes both unit tests and integration tests.</li>
    <li>Please add documentation to <code>docs/source/</code> if the PR
    modifies the user-facing behaviors of vLLM. It helps vLLM user
    understand and utilize the new features or changes.</li>
    </ul>
    
    <h3>Adding or changing kernels</h3>
    <p>Each custom kernel needs a schema and one or more implementations to
    be registered with PyTorch.</p>
    <ul>
    <li>Make sure custom ops are registered following PyTorch guidelines: <a
    href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom
    C++ and CUDA Operators</a> and <a
    href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The
    Custom Operators Manual</a></li>
    <li>Custom operations that return <code>Tensors</code> require
    meta-functions. Meta-functions should be implemented and registered in
    python so that dynamic dims can be handled automatically. See above
    documents for a description of meta-functions.</li>
    <li>Use <a
    href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a>
    to test the function registration and meta-function for any registered
    ops. See <code>tests/kernels</code> for examples.</li>
    <li>When changing the C++ signature of an existing op, the schema must
    be updated to reflect the changes.</li>
    <li>If a new custom type is needed, see the following document: <a
    href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom
    Class Support in PT2</a>.
    </ul>
    
    <h3>Notes for Large Changes</h3>
    <p>Please keep the changes as concise as possible. For major
    architectural changes (>500 LOC excluding kernel/data/config/test), we
    would expect a GitHub issue (RFC) discussing the technical design and
    justification. Otherwise, we will tag it with <code>rfc-required</code>
    and might not go through the PR.</p>
    
    <h3>What to Expect for the Reviews</h3>
    
    <p>The goal of the vLLM team is to be a <i>transparent reviewing
    machine</i>. We would like to make the review process transparent and
    efficient and make sure no contributor feel confused or frustrated.
    However, the vLLM team is small, so we need to prioritize some PRs over
    others. Here is what you can expect from the review process: </p>
    
    <ul>
    <li> After the PR is submitted, the PR will be assigned to a reviewer.
    Every reviewer will pick up the PRs based on their expertise and
    availability.</li>
    <li> After the PR is assigned, the reviewer will provide status update
    every 2-3 days. If the PR is not reviewed within 7 days, please feel
    free to ping the reviewer or the vLLM team.</li>
    <li> After the review, the reviewer will put an <code>
    action-required</code> label on the PR if there are changes required.
    The contributor should address the comments and ping the reviewer to
    re-review the PR.</li>
    <li> Please respond to all comments within a reasonable time frame. If a
    comment isn't clear or you disagree with a suggestion, feel free to ask
    for clarification or discuss the suggestion.
     </li>
    </ul>
    
    <h3>Thank You</h3>
    
    <p> Finally, thank you for taking the time to read these guidelines and
    for your interest in contributing to vLLM. Your contributions make vLLM
    a great tool for everyone! </p>
    
    
    </details>
    Yantom1 authored Oct 7, 2024
    Configuration menu
    Copy the full SHA
    563184a View commit details
    Browse the repository at this point in the history

Commits on Oct 8, 2024

  1. Update offline_inference_fakehpu.py

    Beam search was removed from SamplingParams. In this example it was set to False, with this commit I removed it
    michalkuligowski authored Oct 8, 2024
    Configuration menu
    Copy the full SHA
    0e46492 View commit details
    Browse the repository at this point in the history
  2. Timeout adjusted in MLLMEngine (#368)

    Currently in Multiprocess LLMEngine there is a polling timeout fixed to
    10000 ms . This may not be good when
    we are running torch compiled models that happen to compile (we did not
    have particular configuration -- shape -- model warmed up during warmup
    phase). So torch compilation if happens after warmup then 10000ms is not
    enough. So It would be good to have a way to modify fixed timeout.
    
    Changes disscussed here are replacing fixed timeout of 10000 ms with
    value as provided with VLLM_RPC_TIMEOUT .
    
    Please suggest if separate env var should be made.
    
    Co-authored-by: Jacek Czaja <[email protected]>
    jczaja and Jacek Czaja authored Oct 8, 2024
    Configuration menu
    Copy the full SHA
    6028354 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    64369fd View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    69fb91c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    1ee20c5 View commit details
    Browse the repository at this point in the history
  6. Make workaround for SW-204785 broader (#374)

    PT bridge bug in recent Synapse builds causes PyTest to return 0
    unconditionally. Previous workaround fixed that issue if comparison
    failed, but left out a case in which vLLM (or anything else) actually
    crashes during the test execution. This patch broadens the workaround to
    catch any exceptions and add atexit callback when any test fails.
    kzawora-intel authored Oct 8, 2024
    Configuration menu
    Copy the full SHA
    388e500 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    8f79b6e View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2024

  1. Configuration menu
    Copy the full SHA
    ca98dae View commit details
    Browse the repository at this point in the history

Commits on Oct 10, 2024

  1. Fix LoRA tests by handling broken import (#376)

    This PR fixes the broken import in test_lora_hpu.py 
    
    Issue: https://jira.habana-labs.com/browse/SW-204811
    vivekgoe authored Oct 10, 2024
    Configuration menu
    Copy the full SHA
    4030216 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b70c1a5 View commit details
    Browse the repository at this point in the history

Commits on Oct 11, 2024

  1. Disable performance counters if profiler is not enabled (#383)

    Currently, if `HabanaHighLevelProfiler` is not enabled,
    `HabanaProfilerCounterHelper` collects the statistics that will not be
    used later. This creates additional host overhead that can be removed.
    This change will only allow performance statistics to be collected when
    the profiler is enabled.
    
    Potential gain on `prepare_model_input`:
    - before
    <img width="437" alt="image"
    src="https://github.com/user-attachments/assets/c351c6be-2757-455d-a005-b34e97d47fd6">
    
    - after
    <img width="401" alt="image"
    src="https://github.com/user-attachments/assets/80b7c1d1-051e-4a64-9e7c-eff9cc8d9558">
    kdamaszk authored Oct 11, 2024
    Configuration menu
    Copy the full SHA
    49444bc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d6bd375 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4f1787b View commit details
    Browse the repository at this point in the history

Commits on Oct 12, 2024

  1. Remove constraints for bucket creation during warmup in LoRA (#382)

    This PR removes LoRA constraints during bucket creation in warm-up.
    Fixes high drop in decode throughput when LoRA is enabled for a given
    configuration.
    vivekgoe authored Oct 12, 2024
    Configuration menu
    Copy the full SHA
    6cd4694 View commit details
    Browse the repository at this point in the history

Commits on Oct 14, 2024

  1. seed_everything function doesn't handle HPU (#384)

    This PR adds manual seed setting for HPU in the function
    `seed_everything`.
    
    Previously the torch.manual_seed was getting set to the given seed,
    which got removed in the following PR
    6ffa3f3
    SanjuCSudhakaran authored Oct 14, 2024
    Configuration menu
    Copy the full SHA
    d8f2aa7 View commit details
    Browse the repository at this point in the history
  2. Fixed lora_manager tests with hpu_model_runner (#386)

    lora_manager tests have been fixed with the recent changes of
    hpu_model_runner from habana_model_runner
    rsshaik1 authored Oct 14, 2024
    Configuration menu
    Copy the full SHA
    03b407b View commit details
    Browse the repository at this point in the history
  3. Reformat README_GAUDI.md (#389)

    This PR removes the awkward line breaks in README_GAUDI.md and uses
    appropriate markdown formatting instead of RST. Rendered document should
    look the same.
    kzawora-intel authored Oct 14, 2024
    Configuration menu
    Copy the full SHA
    ebd42c4 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    2d2bf7a View commit details
    Browse the repository at this point in the history
  5. Remove workaround added to resolve multi-card stall issue (#387)

    This PR removes additional `multiprocessing.Process` object created as a
    workaround for resolving multi-card stall issue.
    SanjuCSudhakaran authored Oct 14, 2024
    Configuration menu
    Copy the full SHA
    9df1d4a View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    9777c9f View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    5ceda69 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    3e6a2d4 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    9ac52ab View commit details
    Browse the repository at this point in the history
  10. Oct 7 rebase (#367)

    kzawora-intel authored Oct 14, 2024
    Configuration menu
    Copy the full SHA
    57bc31d View commit details
    Browse the repository at this point in the history

Commits on Oct 15, 2024

  1. Configuration menu
    Copy the full SHA
    55dd07e View commit details
    Browse the repository at this point in the history
  2. [CI] Temporarily increase test tolerances (#392)

    This PR raises the allowed relative tolerance in GSM8K to 0.06, and
    moves Llama-70B test to 4xG2 from 2xG2 until memory usage is
    investigated (success run: vLLM-CI-Pipeline/206)
    kzawora-intel authored Oct 15, 2024
    Configuration menu
    Copy the full SHA
    401f5ae View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e598f3f View commit details
    Browse the repository at this point in the history

Commits on Oct 16, 2024

  1. Configuration menu
    Copy the full SHA
    f77435d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    0783d18 View commit details
    Browse the repository at this point in the history
  3. remove jenkins files

    kzawora-intel committed Oct 16, 2024
    Configuration menu
    Copy the full SHA
    2fa46cd View commit details
    Browse the repository at this point in the history
  4. restore README.md

    kzawora-intel committed Oct 16, 2024
    Configuration menu
    Copy the full SHA
    3683db6 View commit details
    Browse the repository at this point in the history
  5. remove fakehpu

    kzawora-intel committed Oct 16, 2024
    Configuration menu
    Copy the full SHA
    91af5da View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    d2ce468 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    b6428cd View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    5149278 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    f4b356f View commit details
    Browse the repository at this point in the history
  10. remove hpu fused_moe

    kzawora-intel committed Oct 16, 2024
    Configuration menu
    Copy the full SHA
    3eee00d View commit details
    Browse the repository at this point in the history
  11. Remove HPU changes from cache_engine.py (#400)

    We were asked on upstream PR to remove our changes from cache_engine.py.
    This PR does just that, and creates HPUCacheEngine inheriting from
    CacheEngine, just overriding _allocate_kv_cache method.
    kzawora-intel authored Oct 16, 2024
    Configuration menu
    Copy the full SHA
    a59fc7b View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    c07951b View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    398c5c3 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    f79d454 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    8b6e30d View commit details
    Browse the repository at this point in the history

Commits on Oct 17, 2024

  1. [bucketing overhaul 1/n] Add padding-aware scheduling and option to l…

    …imit prefill batch size (#394)
    
    This PR adds following functionality that can be enabled via engine
    flags:
    - use_padding_aware_scheduling - vLLM scheduler will now calculate token
    cost considering padded prefill shape (similar to
    #109).
    - max_num_prefill_seqs - padding-aware scheduler will perform an
    additional check for prefill batch size and will effectively limit
    prefill batch size at maximum of `max_num_prefill_seqs`. If unset, max
    prefill batch size will be `max_num_seqs`.
    Both features are generic and do not require HPU, although they may be
    specialized for particular vendor's usage. Padding aware scheduling
    includes padding function selector which selects HPU padding function
    (considering currently used HPU buckets) if current device is HPU.
    Otherwise, it will take a product of batch_size x max_seq_len.
    kzawora-intel authored Oct 17, 2024
    Configuration menu
    Copy the full SHA
    05bcdf5 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c11f23a View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    78a816c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    640f0be View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    e894746 View commit details
    Browse the repository at this point in the history
  6. cleanup

    kzawora-intel committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    5bc3985 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    14f8af4 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    65e34f6 View commit details
    Browse the repository at this point in the history
  9. doc fixes

    kzawora-intel committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    4757350 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    4c306cf View commit details
    Browse the repository at this point in the history