The main class to describe requests to GptManager
is InferenceRequest
. This is structured as a map of tensors and a uint64_t requestId
.
The mandatory tensors to create a valid InferenceRequest
object are described below. Sampling Config params are documented in more detail here, and descriptions are omitted in the table:
Name | Shape | Type | Description |
---|---|---|---|
request_output_len |
[1,1] | int32_t |
Max number of output tokens |
input_ids |
[1, num_input_tokens] | int32_t |
Tensor of input tokens |
Optional tensors that can be supplied to InferenceRequest
are shown below. Default values, where applicable are specified.:
Name | Shape | Type | Description |
---|---|---|---|
streaming |
[1] | bool |
(Default=false ). When true , stream out tokens as they are generated. When false return only when the full generation has completed. |
beam_width |
[1] | int32_t |
(Default=1) Beam width for this request; set to 1 for greedy sampling |
temperature |
[1] | float |
Sampling Config param: temperature |
runtime_top_k |
[1] | int32_t |
Sampling Config param: topK |
runtime_top_p |
[1] | float |
Sampling Config param: topP |
len_penalty |
[1] | float |
Sampling Config param: lengthPenalty |
early_stopping |
[1] | int |
Sampling Config param: earlyStopping |
repetition_penalty |
[1] | float |
Sampling Config param: repetitionPenalty |
min_length |
[1] | int32_t |
Sampling Config param: minLength |
presence_penalty |
[1] | float |
Sampling Config param: presencePenalty |
frequency_penalty |
[1] | float |
Sampling Config param: frequencyPenalty |
random_seed |
[1] | uint64_t |
Sampling Config param: randomSeed |
end_id |
[1] | int32_t |
End token Id. If not specified, defaults to -1 |
pad_id |
[1] | int32_t |
Pad token Id |
embedding_bias |
[1] | float |
Embedding bias |
bad_words_list |
[2, num_bad_words] | int32_t |
Bad words list |
stop_words_list |
[2, num_stop_words] | int32_t |
Stop words list |
prompt_embedding_table |
[1] | float16 |
P-tuning prompt embedding table |
prompt_vocab_size |
[1] | int32_t |
P-tuning prompt vocab size |
lora_task_id |
[1] | uint64_t |
Task ID for the given lora_weights. This ID is expected to be globally unique. To perform inference with a specific LoRA for the first time lora_task_id lora_weights and lora_config must all be given. The LoRA will be cached, so that subsequent requests for the same task only require lora_task_id . If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if lora_task_id is not cached |
lora_weights |
[ num_lora_modules_layers, D x Hi + Ho x D ] | float (model data type) |
weights for a lora adapter. see lora docs for more details. |
lora_config |
[num_lora_modules_layers, 3] | int32_t |
lora configuration tensor. [ module_id, layer_idx, adapter_size (D aka R value) ] see lora docs for more details. |
return_log_probs |
[1] | bool |
When true , include log probs in the output |
return_context_logits |
[1] | bool |
When true , include context logits in the output |
return_generation_logits |
[1] | bool |
When true , include generation logits in the output |
draft_input_ids |
[num_draft_tokens] | int32_t |
Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration |
draft_logits |
[num_draft_tokens, vocab_size] | float |
Draft logits associated with draft_input_ids to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration |