This code provides a benchmarking runner, AIBench-LLM
, for evaluating the performance of a large language model (LLM) inference endpoint. The benchmark measures various metrics such as Time to First Token (TTFT), End to End Latency, Inter-Token Latency (ITL), Output Tokens per Second, and more.
The AIBench Runner is in charge of collecting metrics from LLM inference endpoints for the Unify Hub. More information about the full methodology is available here 📑
Contributions and discussions around the methodology and the runner are definitely welcome, you can join the Unify Discord if this sounds interesting!
The benchmark runner collects the following metrics:
load
: Number of concurrent requests.input_policy
: Input policy used (short or long).ttft
: Time-to-first-token for each request.e2e_latency
: End-to-end latency for each request.itl
: Inter-token Latency.cold_start
: Cold start time (if applicable).prompt_tokens
: Number of tokens in the input prompt.output_tokens
: Number of tokens in the LLM output.total_tokens
: Total number of tokens (input + output).output_tks_per_sec
: Output tokens per second.failed_queries
: Number of failed queries.
To be added this week!