You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am testing the offline performance using benchmark_latency.py. And I found there's no any increasing/decreasing when I change the prompt bucket shape, even I use (1,1).
Your current environment
I am testing the offline performance using
benchmark_latency.py
. And I found there's no any increasing/decreasing when I change the prompt bucket shape, even I use (1,1).How would you like to use vllm
Prompt bucket config (min, step, max_warmup) bs:[1, 1, 1], seq:[1, 1, 1]:
6: INFO 08-28 11:12:47 habana_model_runner.py:1128] Graph/Prompt captured:1 (100.0%) used_mem:0 B buckets:[(1, 1)]
6: INFO 08-28 11:12:47 habana_model_runner.py:1128] Graph/Decode captured:72 (100.0%) used_mem:3.239 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152)]
6: INFO 08-28 11:12:47 habana_model_runner.py:1206] Warmup finished in 45 secs, allocated 3.451 GiB of device memory
6: INFO 08-28 11:12:47 habana_executor.py:91] init_cache_engine took 46.45 GiB of device memory (61.43 GiB/94.62 GiB used) and 2.484 GiB of host memory (61.33 GiB/1007 GiB used)
6: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
6: Warming up...
Warmup iterations: 100%|██████████| 5/5 [01:08<00:00, 13.71s/it]
Profiling iterations: 100%|██████████| 10/10 [02:16<00:00, 13.64s/it]
6: E2E Throughput: 1200.877 tokens/sec.
Prompt bucket config (min, step, max_warmup) bs:[1, 64, 128], seq:[128, 128, 1024]:
6: INFO 08-28 11:25:56 habana_model_runner.py:1128] Graph/Prompt captured:18 (28.1%) used_mem:19.79 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (4, 128), (4, 256), (4, 384), (8, 128)]
6: INFO 08-28 11:25:56 habana_model_runner.py:1128] Graph/Decode captured:72 (100.0%) used_mem:3.239 GiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (32, 1152), (64, 128), (64, 256), (64, 384), (64, 512), (64, 640), (64, 768), (64, 896), (64, 1024), (64, 1152), (128, 128), (128, 256), (128, 384), (128, 512), (128, 640), (128, 768), (128, 896), (128, 1024), (128, 1152)]
6: INFO 08-28 11:25:56 habana_model_runner.py:1206] Warmup finished in 146 secs, allocated 23.03 GiB of device memory
6: INFO 08-28 11:25:56 habana_executor.py:91] init_cache_engine took 61.18 GiB of device memory (85.15 GiB/94.62 GiB used) and 2.813 GiB of host memory (61.55 GiB/1007 GiB used)
6: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
6: Warming up...
Warmup iterations: 100%|██████████| 5/5 [01:07<00:00, 13.57s/it]
Profiling iterations: 100%|██████████| 10/10 [02:16<00:00, 13.61s/it]
6: E2E Throughput: 1203.730 tokens/sec.
The text was updated successfully, but these errors were encountered: