Skip to content

Commit

Permalink
Run continuous batch size LLM benchmark using the python script (#418)
Browse files Browse the repository at this point in the history
* Support continuous batch size LLM benchmark

* Update doc to use python script

* Update command and output

* Fix minor bug

* Update doc

* Fix typo
  • Loading branch information
nv-hwoo authored Oct 14, 2023
1 parent 05c4741 commit 3d90fcf
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 49 deletions.
36 changes: 28 additions & 8 deletions src/c++/perf_analyzer/docs/examples/profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,15 +76,23 @@ def profile(args, input_data_file):
f"perf_analyzer -m {args.model} -i grpc --async --streaming "
f"--input-data={input_data_file} "
"--profile-export-file=profile_export.json "
"--measurement-mode=count_windows "
"--measurement-request-count=10 "
"--stability-percentage=999"
)
ret = subprocess.run(args=[command], shell=True)
ret.check_returncode()
if args.periodic_concurrency_range:
start, end, step = args.periodic_concurrency_range
command += (
f"--periodic-concurrency-range={start}:{end}:{step} "
f"--request-period={args.request_period}"
)
else:
command += (
"--measurement-mode=count_windows "
"--measurement-request-count=10 "
"--stability-percentage=999"
)
subprocess.run(args=[command], shell=True)


def generate_input_data(args, filename):
def generate_input_data(args, prompt_size, filename):
request_parameters = f"""
{{
"max_tokens": {args.max_tokens},
Expand Down Expand Up @@ -118,6 +126,19 @@ def generate_input_data(args, filename):
default=[10, 10, 1],
help="The range of prompt sizes '<[START, END], STEP>' where END is inclusive.",
)
parser.add_argument(
"--periodic-concurrency-range",
type=int,
nargs=3,
metavar=("START", "END", "STEP"),
help="The range of concurrency level that periodically increases until it reaches END.",
)
parser.add_argument(
"--request-period",
type=int,
default=10,
help="The number of responses that each request must receive before launching new requests.",
)
parser.add_argument(
"--max-tokens",
type=int,
Expand All @@ -132,7 +153,6 @@ def generate_input_data(args, filename):
parser.add_argument(
"--input-data",
type=str,
default=None,
help="The input data file to be used for inference request.",
)
args = parser.parse_args()
Expand All @@ -149,7 +169,7 @@ def generate_input_data(args, filename):
start, end, step = args.prompt_size_range
for prompt_size in range(start, end + 1, step):
if not args.input_data:
generate_input_data(args, TEMP_INPUT_FILE)
generate_input_data(args, prompt_size, TEMP_INPUT_FILE)

profile(args, args.input_data if args.input_data else TEMP_INPUT_FILE)
avg_first_token_latency, avg_token_to_token_latency = calculate_avg_latencies()
Expand Down
62 changes: 21 additions & 41 deletions src/c++/perf_analyzer/docs/llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,55 +127,35 @@ python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 256 --ign
# Prompt size: 500, Average first-token latency: 0.0400 sec, Average token-token latency: 0.0070 sec
```
### Benchmark 3: Profiling Continuous Batch Size
## Benchmark 3: Profiling Continuous Batch Size

> **Note**
>
> This benchmark relies on the feature that will be available from `23.10` release
> which is on its way soon. You can either wait until the `23.10` container
> is ready or build Perf Analyzer from the latest `main` branch (see [build from source instructions](install.md#build-from-source)).
In this benchmarking scenario, we want to measure the effect of continuous
batch size on token-to-token latency. We systematically issue requests to the
server of fixed input sizes and request the model to compute a fixed amount of
tokens in order to increase the continuous batching size over time.

#### 1. Generate prompts input data JSON

```bash
# open a new shell in the same directory you were in when running the above command
echo '
{
"data": [
{
"PROMPT": [
"Hello, my name is"
],
"STREAM": [
true
],
"SAMPLING_PARAMETERS": [
"{\"max_tokens\":16,\"ignore_eos\":true}"
]
}
]
}
' > prompts.json
```
#### Example

#### 2. Run Perf Analyzer
In this benchmark, we are interested in how continuous batch size affects token-to-token latency
by increasing the number of concurrent requests to the model.
Perf Analyzer will run in [periodic concurrency mode](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/docs/inference_load_modes.md#periodic-concurrency-mode)
that periodically launches a new concurrent request to the model using `--periodic-concurrency-range START END STEP` option.
In this example, Perf Analyzer starts with a single request and launches the new ones until the total number reaches 30.
You can also specify the timing of the new requests: For example, setting the `--request-period` to 50 will make
Perf Analyzer to wait for all the requests to receive 50 responses before it launches the new requests.

```bash
perf_analyzer \
-m vllm \
-i grpc \
--async \
--streaming \
--input-data=prompts.json \
--profile-export-file=profile_export.json \
--periodic-concurrency-range=1:20:1
--request-period=10
```

#### 3. Calculate average token-to-token latency
python profile.py -m vllm --prompt-size-range 100 500 200 --periodic-concurrency-range 1 30 1 --request-period 50 --max-tokens 256 --ignore-eos

```bash
python3 examples/calculate_avg_token_to_token_latency.py
# Average token-to-token latency: 0.003090155677419355 s
# Sample output
# [ Benchmark Summary ]
# Prompt size: 100, Average first-token latency: 0.0381 sec, Average token-token latency: 0.0106 sec
# Prompt size: 300, Average first-token latency: 0.0347 sec, Average token-token latency: 0.0109 sec
# Prompt size: 500, Average first-token latency: 0.0336 sec, Average token-token latency: 0.0101 sec
```

#### 4. Repeat steps 1-3 with different period concurrency range start/end/step and different request period to measure effects of continuous batch size on token-to-token latency (generation).

0 comments on commit 3d90fcf

Please sign in to comment.