-
-
Notifications
You must be signed in to change notification settings - Fork 134
3. Engine Options
Below are the full list of available options to use with the Aphrodite Engine API server.
Flag | Description |
---|---|
--model MODEL |
Name or path of the Hugging Face model to use. |
--tokenizer TOKENIZER |
Name or path of the Hugging Face tokenizer to use. Defaults to model. |
--revision REVISION |
The specific model version to use. Can be a branch, tag, or commit ID. Defaults to main. |
--code-revision REVISION |
The code revision to use for models with remote code. |
--tokenizer-revision REVISION |
The tokenizer revision to use. |
--tokenizer-mode {auto,slow} |
Tokenizer mode. "auto" will use the fast tokenizer if available. |
--trust-remote-code |
Trust the model's remote code from Hugging Face. |
--download-dir DIRECTORY |
The directory to download the model to. |
--load-format {auto,pt,safetensors,npcache,dummy} |
The format of the model weights. Defaults to auto. |
--dtype {auto,float16,bfloat16,float32} |
The data type to use. "auto" will use FP16 for FP32/FP16 models. |
--max-model-len LENGTH |
The model context size. Defaults to the model's original. If set to a higher value, will use automatic RoPE scaling. |
--guided-decoding-backend {outlines,lm-format-enforcer} |
The engine to use for guided decoding (JSON Schema / regex etc) |
--enforce-eager {true,false} |
If True, disable CUDA graphs. Defaults to True. |
--max-context-len-to-capture LENGTH |
Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. |
--max-log-probs NUM |
The maximum number of logprobs to output for requests. Defaults to 10. |
Flag | Description |
---|---|
--worker-use-ray |
Use Ray for distributed serving. Will be automatically set if using > 1 GPU. |
--tensor-parallel-size (-tp) SIZE |
The number of GPUs to use for loading the model. |
--pipeline-parallel-size (-pp) SIZE |
The number of pipeline stages to use for loading the model. Currently unsupported. |
--ray-workers-use-nsight |
Use nsight for profiling Ray workers. |
--disable-custom-all-reduce |
If set, disables the custom all-reduce kernels. Set this if you experience instabilities with multi-gpu setups. |
Flag | Description |
---|---|
--kv-cache-dtype {auto,fp8} |
The data type for KV cache. If "auto", will use the model's data type. "fp8" will quantize it to 8bits, offering lower memory usage and improved throughput. |
--quantization (-q) {aqlm,awq,bnb,eetq,exl2,gguf,gptq,quip,squeezellm,marlin} |
The method used to quantize the weights you're loading. Will try to automatically infer from the model. If unsuccessful, pass this flag manually. |
--load-in-4bit |
Load the 32/16bit or AWQ model in 4bit, using SmoothQuant+. |
--load-in-smooth |
Load the 32/16bit model in 8bit, using SmoothQuant+. |
--load-in-8bit |
Load the 32/16bit model in 8bit, using BitsAndBytes. |
--quantization-param-path PATH |
Path to the JSON file containing the KV cache scaling factor for AMD GPUs, applicable for FP8 quantization on AMD only. |
Flag | Description |
---|---|
--block-size {8,16,32} |
The token block size. Defaults to 16. |
--context-shift |
Enable context shifting. Caches the previously processed prompts to be reused later. |
--num-gpu-blocks-override NUM |
If specified, ignore the profiling result and use this number of GPU blocks. |
--swap-space SIZE |
The amount of CPU swap space to use, in GiBs. |
--gpu-memory-utilization (-gmu) FRACTION |
The percentage of VRAM to use per GPU. Set to 0.9 (90%) by default. |
Flag | Description |
---|---|
--tokenizer-pool-size SIZE |
Size of tokenizer pool to use for async tokenization. If 0, will use synchronous tokenization. |
--tokenizer-pool-type {ray} |
The type of tokenizer pool to use for async tokenization. Only "ray" is supported for now. |
--tokenizer-pool-extra-config CONFIG |
Extra config for the tokenizer pool. Should be a JSON string that will be parsed into a dictionary. |
Flags | Description |
---|---|
--max-num-batched-tokens NUM |
The maximum number of tokens to be processed in a single iteration. |
--max-num-seqs NUM |
The maximum number of sequences to be processed in a single iteration. Defaults to 256. |
--use-v2-block-manager |
Wehther to use the BlockSpaceManagerV2 or not. |
--delay-factor FACTOR |
Apply a delay (of delay factor multiplied by previous prompt latency) before scheduling the next prompt. |
--policy {fcfs} |
The scheduling policy to use. |
--enable-chunked-prefill |
If True, prefill requests can be chunked based on the remaining max_num_batched_tokens. Greatly reduces memory usage for GQA models. |
Flags | Description |
---|---|
--device {auto,cuda,neuron,cpu} |
The device to use for the engine. |
Flags | Description |
---|---|
--num-lookahead-slots NUM |
The number of slots to allocate per sequence per step, beyond the known token IDs. Used to store KV activations of tokens which may or may not be accepted. |
--speculative-model {MODEL,"[ngram]"} |
The name or path of the draft model to be used. This can either be a Hugging Face model, or just "[ngram]" to use ngram prompt lookup decoding. |
--num-speculative-tokens NUM |
The number of speculative tokens to sample from the draft model. |
--speculative-max-model-len LEN |
The maximum sequence length supported by the draft model. Sequences over this length will skip speculation. |
--ngram-prompt-lookup-max NUM |
Max size of window for ngram prompt lookup. |
--ngram-prompt-lookup-min NUM |
Min size of window for ngram prompt lookup. |
Flags | Description |
---|---|
--enable-lora |
Enable loading LoRA adapter weights. |
--max-loras NUM |
The maximum number of LoRAs in a single batch. |
--max-lora-rank {8,16,32,64} |
The maximum LoRA rank. Defaults to 16. |
--lora-extra-vocab-size NUM |
Maximum size of extra vocabulary that can be present in a LoRA adapter (added to the base model vocab). |
--lora-dtype {auto,float16,bfloat16,float32} |
Data type for LoRA. If auto, defaults to base model dtype. |
--max-cpu-loras NUM |
Maximum number of LoRAs to store in CPU memory. |
Flags | Description |
---|---|
--host HOST |
The host name. Defaults to localhost. |
--port PORT |
The port number. Defaults to 2242. |
--allowed-credentials STR |
The allowed credentials. |
--allowed-origins ALLOWED_ORIGINS |
The allowed origins. |
--allowed-methods ALLOWED_METHODS |
The allowed methods. |
--allowed-headers ALLOWED_HEADERS |
The allowed headers. |
--api-keys API_KEY |
The API key to use, for securing the endpoint. |
--admin-key ADMIN_KEY |
The admin API key to use, for admin operations. |
--launch-kobold-api |
Launch the Kobold API server in addition to the OpenAI one. |
--max-length LENGTH |
The maximum output length. Used for Horde, has no effect. |
--served-model-name NAME |
The model name to use in the API. If unspecified, uses --model . |
--lora-modules LORA_MODULES [LORA_MODULES ...] |
The individual LoRA modules to load for the API server. Can also be handled on-the-fly using the /v1/lora endpoint. |
--chat-template PATH |
The file path to the chat template. By default, attempts to extract from the model if available. |
--response-role |
The role name to return if add_generation_prompt=True. |
--ssl-keyfile PATH |
The file path to the SSL key file. |
--ssl-certfile PATH |
The file path to the SSL cert file. |
--root-path PATH |
FastAPI root_path when app is behind a path-based routing proxy. |
--middleware MIDDLEWARE |
Additional ASGI middleware to apply to the app. We accept multiple --middleware arguments. The value should be an import path. If a function is provided, Aphrodite will add it to the server using @app.middlware('http'). If a class is provided, it will be app.add_middleware(). |
Flags | Description |
---|---|
--image-input-type {pixel_values,image_features} |
The image input type passed to Aphrodite. |
--image-token-id ID |
Input ID for the image token. |
--image-input-shape SHAPE |
The biggest image input shape (worst for memory footprint) given an input type. Only used for Aphrodite's profile_run. |
--image-feature-size SIZE |
The image feature size along the context dimension. |
n
The number of output sequences to return for a prompt.
best_of
Number of output sequences to generate from the prompt. From these best_of
sequences, the top n
sequences are returned. By default, it's set equal to n
. If use_beam_search
is True, it'll be treated as beam width.
seed
The random seed to use for generation.
presence_penalty
Penalize new tokens based on whether they appear in the generated text so far. Setting it to higher than 0 encourages the model to use new tokens, and lower than zero encourages the model to repeat tokens. Disabled: 0.
frequency_penalty
Penalize new tokens based on their frequency in the generated text so far. Setting it to higher than 0 encourages the model to use new tokens, and lower than zero encourages the model to repeat tokens. Disabled: 0.
repetition_penalty
Penalize new tokens based on whether they appear in the prompt and the generated text so far. Values higher than 1 encourage the model to use new tokens, while lower than 1 encourage the model to repeat tokens. Disabled: 1.
temperature
Control the randomness of the output. Lower values make the model more deterministic, while higher values make the model more random. Disabled: 1.
dynatemp_range
Allows the user to use a Dynamic Temperature that scales based on the entropy of token probabilities (normalized by the maximum possible entropy for a distribution so it scales well across different K values). Controls the variability of token probabilities. Dynamic Temperature takes a minimum and maximum temperature values; minimum temperature will be calculated as temperature - dynatemp_range
, and maximum temperature as temperature + dynatemp_range
. Disabled: 0.
dynatemp_exponent
The exponent value for dynamic temperature. Defaults to 1. Higher values will trend towards lower temperatures, lower values will trend toward higher temperatures.
smoothing_factor
The smoothing factor to use for Quadratic Sampling. Disabled: 0.0.
smoothing_curve
The smoothing curve to use for Cubic Sampling. Disabled: 1.0.
top_p
Control the cumulative probability of the top tokens to consider. Disabled: 1.
top_k
Control the number of top tokens to consider. Disabled: -1.
top_a
Controls the threshold probability for tokens, reducing randomness when AI certainty is high. Does not significantly affect output creativity. Disabled: 0.
min_p
Controls the minimum probability for a token to be considered, relative to the probability of the most likely token. Disable: 0.
tfs
Tail-Free Sampling. Eliminates low probability tokens after identifying a plateau in sorted token probabilities. It minimally affects the creativity of the output and is best used for longer texts. Disabled: 1.
eta_cutoff
Used in Eta sampling, it adapts the cutoff threshold based on the entropy of the token probabilities, optimizing token selection. Value is in units of 1e-4. Disabled: 0.
epsilon_cutoff
Used in Epsilon sampling, it sets a simple probability threshold for token selection. Value is in units of 1e-4. Disabled: 0.
typical_p
This method regulates the information content in the generated text by sorting tokens based on the sum of entropy and the natural logarithm of token probability. It has a strong effect on output content but still maintains creativity even at low settings. Disabled: 1.
mirostat_mode
The mirostat mode to use. Only 2
is currently supported. Mirostat is an adaptive decoding algorithm that generates text with a predetermined perplexity value, providing control over repetitions and thus ensuring high-quality, coherent, and fluent text. Disabled: 0.
mirostat_tau
The target "surprise" value that Mirostat works towards. Range is in 0 to infinity.
mirostat_eta
The learning rate at which Mirostat updates its internal suprise value. Range is from 0 to infinity.
use_beam_search
Whether to use beam search instead of normal sampling.
length_penalty
Penalize sequences based on their length. Used in beam search.
early_stopping
Controls the stopping condition for beam search. It accepts the following values: True
, where the generation stops as soon as there are best_of
complete candidates; False
, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never"
, where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).
stop
List of strings (words) that stop the generation when they are generated. The returned output will not contain the stop strings.
stop_token_ids
List of token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens (e.g. EOS).
include_stop_str_in_output
Whether to include the stop strings in output text. Default: False.
ignore_eos
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
max_tokens
The maximum number of tokens to generate per output sequence.
logprobs
Number of log probabilities to return per output token. Note that the implementation follows the OpenAI API: The return result includes the log probabilities on the logprobs
most likely tokens, as well as the chosen tokens. The API will always return the log probability of the sampled token, so there may be up to logprobs+1
elements in the response.
prompt_logprobs
Number of log probabilities to return per prompt token.
custom_token_bans
List of token IDs to ban from being generated.
skip_special_tokens
Whether to skip special tokens in the output. Default: True.
spaces_between_special_tokens
Whether to add spaces between special tokens in the output. Defaults: True.
logits_processors
List of LogitsProcessors to change the probability of token prediction at runtime. Aliased to logit_bias
in the API request body.