Run continuous batch size LLM benchmark using the python script (#418)

* Support continuous batch size LLM benchmark * Update doc to use python script * Update command and output * Fix minor bug * Update doc * Fix typo
triton-inference-server · Oct 14, 2023 · 3d90fcf · 3d90fcf
1 parent 05c4741
commit 3d90fcf
Show file tree

Hide file tree

Showing 2 changed files with 49 additions and 49 deletions.
diff --git a/src/c++/perf_analyzer/docs/examples/profile.py b/src/c++/perf_analyzer/docs/examples/profile.py
@@ -76,15 +76,23 @@ def profile(args, input_data_file):
         f"perf_analyzer -m {args.model} -i grpc --async --streaming "
         f"--input-data={input_data_file} "
         "--profile-export-file=profile_export.json "
-        "--measurement-mode=count_windows "
-        "--measurement-request-count=10 "
-        "--stability-percentage=999"
     )
-    ret = subprocess.run(args=[command], shell=True)
-    ret.check_returncode()
+    if args.periodic_concurrency_range:
+        start, end, step = args.periodic_concurrency_range
+        command += (
+            f"--periodic-concurrency-range={start}:{end}:{step} "
+            f"--request-period={args.request_period}"
+        )
+    else:
+        command += (
+            "--measurement-mode=count_windows "
+            "--measurement-request-count=10 "
+            "--stability-percentage=999"
+        )
+    subprocess.run(args=[command], shell=True)
 
 
-def generate_input_data(args, filename):
+def generate_input_data(args, prompt_size, filename):
     request_parameters = f"""
     {{
         "max_tokens": {args.max_tokens},
@@ -118,6 +126,19 @@ def generate_input_data(args, filename):
         default=[10, 10, 1],
         help="The range of prompt sizes '<[START, END], STEP>' where END is inclusive.",
     )
+    parser.add_argument(
+        "--periodic-concurrency-range",
+        type=int,
+        nargs=3,
+        metavar=("START", "END", "STEP"),
+        help="The range of concurrency level that periodically increases until it reaches END.",
+    )
+    parser.add_argument(
+        "--request-period",
+        type=int,
+        default=10,
+        help="The number of responses that each request must receive before launching new requests.",
+    )
     parser.add_argument(
         "--max-tokens",
         type=int,
@@ -132,7 +153,6 @@ def generate_input_data(args, filename):
     parser.add_argument(
         "--input-data",
         type=str,
-        default=None,
         help="The input data file to be used for inference request.",
     )
     args = parser.parse_args()
@@ -149,7 +169,7 @@ def generate_input_data(args, filename):
     start, end, step = args.prompt_size_range
     for prompt_size in range(start, end + 1, step):
         if not args.input_data:
-            generate_input_data(args, TEMP_INPUT_FILE)
+            generate_input_data(args, prompt_size, TEMP_INPUT_FILE)
 
         profile(args, args.input_data if args.input_data else TEMP_INPUT_FILE)
         avg_first_token_latency, avg_token_to_token_latency = calculate_avg_latencies()

diff --git a/src/c++/perf_analyzer/docs/llm.md b/src/c++/perf_analyzer/docs/llm.md
@@ -127,55 +127,35 @@ python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 256 --ign
 #   Prompt size: 500, Average first-token latency: 0.0400 sec, Average token-token latency: 0.0070 sec
 ```
 
-### Benchmark 3: Profiling Continuous Batch Size
+## Benchmark 3: Profiling Continuous Batch Size
+
+> **Note**
+>
+> This benchmark relies on the feature that will be available from `23.10` release
+> which is on its way soon. You can either wait until the `23.10` container
+> is ready or build Perf Analyzer from the latest `main` branch (see [build from source instructions](install.md#build-from-source)).
 
 In this benchmarking scenario, we want to measure the effect of continuous
 batch size on token-to-token latency. We systematically issue requests to the
 server of fixed input sizes and request the model to compute a fixed amount of
 tokens in order to increase the continuous batching size over time.
 
-#### 1. Generate prompts input data JSON
-
-```bash
-# open a new shell in the same directory you were in when running the above command
-echo '
-{
-    "data": [
-        {
-            "PROMPT": [
-                "Hello, my name is"
-            ],
-            "STREAM": [
-                true
-            ],
-            "SAMPLING_PARAMETERS": [
-                "{\"max_tokens\":16,\"ignore_eos\":true}"
-            ]
-        }
-    ]
-}
-' > prompts.json
-```
+#### Example
 
-#### 2. Run Perf Analyzer
+In this benchmark, we are interested in how continuous batch size affects token-to-token latency
+by increasing the number of concurrent requests to the model.
+Perf Analyzer will run in [periodic concurrency mode](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/docs/inference_load_modes.md#periodic-concurrency-mode)
+that periodically launches a new concurrent request to the model using `--periodic-concurrency-range START END STEP` option.
+In this example, Perf Analyzer starts with a single request and launches the new ones until the total number reaches 30.
+You can also specify the timing of the new requests: For example, setting the `--request-period` to 50 will make
+Perf Analyzer to wait for all the requests to receive 50 responses before it launches the new requests.
 
 ```bash
-perf_analyzer \
-    -m vllm \
-    -i grpc \
-    --async \
-    --streaming \
-    --input-data=prompts.json \
-    --profile-export-file=profile_export.json \
-    --periodic-concurrency-range=1:20:1
-    --request-period=10
-```
-
-#### 3. Calculate average token-to-token latency
+python profile.py -m vllm --prompt-size-range 100 500 200 --periodic-concurrency-range 1 30 1 --request-period 50 --max-tokens 256 --ignore-eos
 
-```bash
-python3 examples/calculate_avg_token_to_token_latency.py
-# Average token-to-token latency: 0.003090155677419355 s
+# Sample output
+# [ Benchmark Summary ]
+#   Prompt size: 100, Average first-token latency: 0.0381 sec, Average token-token latency: 0.0106 sec
+#   Prompt size: 300, Average first-token latency: 0.0347 sec, Average token-token latency: 0.0109 sec
+#   Prompt size: 500, Average first-token latency: 0.0336 sec, Average token-token latency: 0.0101 sec
 ```
-
-#### 4. Repeat steps 1-3 with different period concurrency range start/end/step and different request period to measure effects of continuous batch size on token-to-token latency (generation).