Update llm.md

triton-inference-server · Oct 10, 2023 · 815d2d1 · 815d2d1
1 parent 781939d
commit 815d2d1
Showing 1 changed file with 22 additions and 21 deletions.
diff --git a/src/c++/perf_analyzer/docs/llm.md b/src/c++/perf_analyzer/docs/llm.md
@@ -58,38 +58,39 @@ docker run --gpus all -it --rm --net host -v ${PWD}:/work -w /work nvcr.io/nvidi
 
 ## Benchmark 1: Profiling the Prefill Phase
 
-In this benchmarking scenario, we want to measure the effect of input prompt
+In this benchmarking scenario, we want to measure the effect of text input
 size on first-token latency. We issue single request to the server of fixed
 input sizes and request the model to compute at most one new token. This
 essentially means one pass through the model.
 
 #### Example
 
-Inside the client container, run the following command to generate dummy prompts
-of size 100, 300, and 500 and receive single token from the model for each prompt.
+Inside the client container, run the following command to generate dummy text inputs
+of size 100, 300, and 500 and receive single token from the model for each text input.
 
 ```bash
-python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 1
+python profile.py -m vllm --text-input-size-range 100 500 200 --max-tokens 1
 
 # Sample output
 # [ Benchmark Summary ]
-#   Prompt size: 100, Average first-token latency: 0.0459 sec
-#   Prompt size: 300, Average first-token latency: 0.0415 sec
-#   Prompt size: 500, Average first-token latency: 0.0451 sec
+#   Text input size: 100, Average first-token latency: 0.0459 sec
+#   Text input size: 300, Average first-token latency: 0.0415 sec
+#   Text input size: 500, Average first-token latency: 0.0451 sec
 ```
 
 > **Note**
 >
-> In order to provide a specific prompt (instead of the dummy prompt generated by default),
-> the user can provide input data JSON file using `--input-data` option.
-> This will however *ignore* any parameters specified through the command line.
+> In order to provide a specific text input (instead of the dummy text input
+> generated by default), the user can provide input data JSON file using
+> `--input-data` option. This will however *ignore* any parameters specified
+> through the command line.
 > ```bash
 > $ echo '
 > {
 >     "data": [
 >         {
 >             "text_input": [
->                 "Hello, my name is"  // user-provided prompt
+>                 "Hello, my name is"  // user-provided text input
 >             ],
 >             "stream": [
 >                 true
@@ -108,23 +109,23 @@ python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 1
 
 ## Benchmark 2: Profiling the Generation Phase
 
-In this benchmarking scenario, we want to measure the effect of input prompt
+In this benchmarking scenario, we want to measure the effect of text input
 size on token-to-token latency. We issue single request to the server of fixed
 input sizes and request the model to compute a fixed amount of tokens.
 
 #### Example
 
-Inside the client container, run the following command to generate dummy prompts
-of size 100, 300, and 500 and receive total 256 tokens from the model for each prompts.
+Inside the client container, run the following command to generate dummy text inputs
+of size 100, 300, and 500 and receive total 256 tokens from the model for each text input.
 
 ```bash
-python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 256 --ignore-eos
+python profile.py -m vllm --text-input-size-range 100 500 200 --max-tokens 256 --ignore-eos
 
 # Sample output
 # [ Benchmark Summary ]
-#   Prompt size: 100, Average first-token latency: 0.0388 sec, Average token-token latency: 0.0066 sec
-#   Prompt size: 300, Average first-token latency: 0.0431 sec, Average token-token latency: 0.0071 sec
-#   Prompt size: 500, Average first-token latency: 0.0400 sec, Average token-token latency: 0.0070 sec
+#   Text input size: 100, Average first-token latency: 0.0388 sec, Average token-token latency: 0.0066 sec
+#   Text input size: 300, Average first-token latency: 0.0431 sec, Average token-token latency: 0.0071 sec
+#   Text input size: 500, Average first-token latency: 0.0400 sec, Average token-token latency: 0.0070 sec
 ```
 
 ### Benchmark 3: Profiling Continuous Batch Size
@@ -134,7 +135,7 @@ batch size on token-to-token latency. We systematically issue requests to the
 server of fixed input sizes and request the model to compute a fixed amount of
 tokens in order to increase the continuous batching size over time.
 
-#### 1. Generate prompts input data JSON
+#### 1. Generate input data JSON
 
 ```bash
 # open a new shell in the same directory you were in when running the above command
@@ -154,7 +155,7 @@ echo '
         }
     ]
 }
-' > prompts.json
+' > text_inputs.json
 ```
 
 #### 2. Run Perf Analyzer
@@ -165,7 +166,7 @@ perf_analyzer \
     -i grpc \
     --async \
     --streaming \
-    --input-data=prompts.json \
+    --input-data=text_inputs.json \
     --profile-export-file=profile_export.json \
     --periodic-concurrency-range=1:20:1
     --request-period=10