[Usage]: How to run FP8 inference #453

warlock135 · 2024-11-03T05:56:35Z

Your current environment

Version: v0.5.3.post1+Gaudi-1.18.0
Models: [Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct)
Hardware: 8xHL-225

How would you like to use vllm

I'm trying to run FP8 inference on Meta-Llama-3-70B-Instruct using vLLM with FP8 quantization. I successfully launched vLLM with the following command:

QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

However, when starting inference, vLLM reported an error.

ERROR 11-03 05:27:49 async_llm_engine.py:671] Engine iteration timed out. This should never happen!
ERROR 11-03 05:27:49 async_llm_engine.py:56] Engine background task failed
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 11-03 05:27:49 async_llm_engine.py:56]     await waiter
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 11-03 05:27:49 async_llm_engine.py:56]
ERROR 11-03 05:27:49 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 11-03 05:27:49 async_llm_engine.py:56]     return_value = task.result()
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
ERROR 11-03 05:27:49 async_llm_engine.py:56]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 11-03 05:27:49 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 11-03 05:27:49 async_llm_engine.py:56]   File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 11-03 05:27:49 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 11-03 05:27:49 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7476881fcd90>>)(<Task finishe...imeoutError()>) at /vllm-fork/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 644, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 643, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/vllm-fork/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/vllm-fork/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/vllm-fork/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause

In addition, the warm-up phase with this setup took about 10 hours to complete.
What is the correct way to run FP8 inference with this vLLM fork?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

afierka-intel · 2024-11-04T09:23:06Z

Hello @warlock135.

Thank you for very detailed description! Tiny detail I missed is which branch you used, however please use habana_main branch and then you can set following environments:

export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600  # this timeout you experience right now, value is in seconds
export VLLM_RPC_TIMEOUT=600000  # this timeout you can experience in the future, value is in microseconds

You can test your server skipping warmup stage via this env:

export VLLM_SKIP_WARMUP=true

This can help you to save a lot of time for warmup.
NOTE: We do not recommend to run vLLM server without warmup in production environment, however this option is good for development and testing.

Summarizing this command should help you quickly verify if the configuration is working fine:

VLLM_SKIP_WARMUP=true \
LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

and this should works fine in production:

LLM_ENGINE_ITERATION_TIMEOUT_S=3600 \
VLLM_RPC_TIMEOUT=600000 \
QUANT_CONFIG=/work/Meta-Llama-3-70B-Instruct-FP8-Inc/meta-llama-3-70b-instruct/maxabs_quant_g2.json \
VLLM_DECODE_BLOCK_BUCKET_MAX=6144 \
VLLM_PROMPT_SEQ_BUCKET_MAX=6144 \
PT_HPU_LAZY_MODE=1 \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
python3 -m vllm.entrypoints.openai.api_server --model Meta-Llama-3-70B-Instruct \
--port 9002 --gpu-memory-utilization 0.94 --tensor-parallel-size 8 \
--disable-log-requests --block-size 128 --quantization inc \
--kv-cache-dtype fp8_inc --device hpu --weights-load-device hpu

Once the recommended solution works for you please close the issue, otherwise I'm open for further discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How to run FP8 inference #453

[Usage]: How to run FP8 inference #453

warlock135 commented Nov 3, 2024 •

edited

Loading

afierka-intel commented Nov 4, 2024

[Usage]: How to run FP8 inference #453

[Usage]: How to run FP8 inference #453

Comments

warlock135 commented Nov 3, 2024 • edited Loading

Your current environment

How would you like to use vllm

Before submitting a new issue...

afierka-intel commented Nov 4, 2024

warlock135 commented Nov 3, 2024 •

edited

Loading