You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the backend qwen model does not enable the decouple mode(streaming), however i found the openapi response did not show token usage. Below is an example:
We just cannot return token usage since trtllm backend does't report that for us. To enable that, maybe you need to customize the triton side alongwith the proxy here.
the backend qwen model does not enable the decouple mode(streaming), however i found the openapi response did not show token usage. Below is an example:
ChatCompletion(id='cmpl-45f33530-2dcc-4352-8d97-1dd056efb2e0', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='System: 你是一个知识百科全书助手,可以回答各种问题。\nUser: 什么是牛顿第一定律?\nASSISTANT: 牛顿第一定律,也被称为惯性定律,认为如果一个物体', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1728957474, model='ensemble', object='text_completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=0, prompt_tokens=0, total_tokens=0, completion_tokens_details=None, prompt_tokens_details=None))
The text was updated successfully, but these errors were encountered: