Skip to content

Commit

Permalink
Update llm.md
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewkotila authored Oct 10, 2023
1 parent 781939d commit 815d2d1
Showing 1 changed file with 22 additions and 21 deletions.
43 changes: 22 additions & 21 deletions src/c++/perf_analyzer/docs/llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,38 +58,39 @@ docker run --gpus all -it --rm --net host -v ${PWD}:/work -w /work nvcr.io/nvidi

## Benchmark 1: Profiling the Prefill Phase

In this benchmarking scenario, we want to measure the effect of input prompt
In this benchmarking scenario, we want to measure the effect of text input
size on first-token latency. We issue single request to the server of fixed
input sizes and request the model to compute at most one new token. This
essentially means one pass through the model.

#### Example

Inside the client container, run the following command to generate dummy prompts
of size 100, 300, and 500 and receive single token from the model for each prompt.
Inside the client container, run the following command to generate dummy text inputs
of size 100, 300, and 500 and receive single token from the model for each text input.

```bash
python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 1
python profile.py -m vllm --text-input-size-range 100 500 200 --max-tokens 1

# Sample output
# [ Benchmark Summary ]
# Prompt size: 100, Average first-token latency: 0.0459 sec
# Prompt size: 300, Average first-token latency: 0.0415 sec
# Prompt size: 500, Average first-token latency: 0.0451 sec
# Text input size: 100, Average first-token latency: 0.0459 sec
# Text input size: 300, Average first-token latency: 0.0415 sec
# Text input size: 500, Average first-token latency: 0.0451 sec
```

> **Note**
>
> In order to provide a specific prompt (instead of the dummy prompt generated by default),
> the user can provide input data JSON file using `--input-data` option.
> This will however *ignore* any parameters specified through the command line.
> In order to provide a specific text input (instead of the dummy text input
> generated by default), the user can provide input data JSON file using
> `--input-data` option. This will however *ignore* any parameters specified
> through the command line.
> ```bash
> $ echo '
> {
> "data": [
> {
> "text_input": [
> "Hello, my name is" // user-provided prompt
> "Hello, my name is" // user-provided text input
> ],
> "stream": [
> true
Expand All @@ -108,23 +109,23 @@ python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 1
## Benchmark 2: Profiling the Generation Phase
In this benchmarking scenario, we want to measure the effect of input prompt
In this benchmarking scenario, we want to measure the effect of text input
size on token-to-token latency. We issue single request to the server of fixed
input sizes and request the model to compute a fixed amount of tokens.
#### Example
Inside the client container, run the following command to generate dummy prompts
of size 100, 300, and 500 and receive total 256 tokens from the model for each prompts.
Inside the client container, run the following command to generate dummy text inputs
of size 100, 300, and 500 and receive total 256 tokens from the model for each text input.
```bash
python profile.py -m vllm --prompt-size-range 100 500 200 --max-tokens 256 --ignore-eos
python profile.py -m vllm --text-input-size-range 100 500 200 --max-tokens 256 --ignore-eos
# Sample output
# [ Benchmark Summary ]
# Prompt size: 100, Average first-token latency: 0.0388 sec, Average token-token latency: 0.0066 sec
# Prompt size: 300, Average first-token latency: 0.0431 sec, Average token-token latency: 0.0071 sec
# Prompt size: 500, Average first-token latency: 0.0400 sec, Average token-token latency: 0.0070 sec
# Text input size: 100, Average first-token latency: 0.0388 sec, Average token-token latency: 0.0066 sec
# Text input size: 300, Average first-token latency: 0.0431 sec, Average token-token latency: 0.0071 sec
# Text input size: 500, Average first-token latency: 0.0400 sec, Average token-token latency: 0.0070 sec
```
### Benchmark 3: Profiling Continuous Batch Size
Expand All @@ -134,7 +135,7 @@ batch size on token-to-token latency. We systematically issue requests to the
server of fixed input sizes and request the model to compute a fixed amount of
tokens in order to increase the continuous batching size over time.

#### 1. Generate prompts input data JSON
#### 1. Generate input data JSON

```bash
# open a new shell in the same directory you were in when running the above command
Expand All @@ -154,7 +155,7 @@ echo '
}
]
}
' > prompts.json
' > text_inputs.json
```

#### 2. Run Perf Analyzer
Expand All @@ -165,7 +166,7 @@ perf_analyzer \
-i grpc \
--async \
--streaming \
--input-data=prompts.json \
--input-data=text_inputs.json \
--profile-export-file=profile_export.json \
--periodic-concurrency-range=1:20:1
--request-period=10
Expand Down

0 comments on commit 815d2d1

Please sign in to comment.