Response parameter CompletionUsage return 0 token consumption #56

moyerlee · 2024-10-16T01:17:56Z

the backend qwen model does not enable the decouple mode(streaming), however i found the openapi response did not show token usage. Below is an example:

ChatCompletion(id='cmpl-45f33530-2dcc-4352-8d97-1dd056efb2e0', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='System: 你是一个知识百科全书助手，可以回答各种问题。\nUser: 什么是牛顿第一定律？\nASSISTANT: 牛顿第一定律，也被称为惯性定律，认为如果一个物体', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1728957474, model='ensemble', object='text_completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=0, prompt_tokens=0, total_tokens=0, completion_tokens_details=None, prompt_tokens_details=None))

npuichigo · 2024-10-16T08:13:38Z

We just cannot return token usage since trtllm backend does't report that for us. To enable that, maybe you need to customize the triton side alongwith the proxy here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response parameter CompletionUsage return 0 token consumption #56

Response parameter CompletionUsage return 0 token consumption #56

moyerlee commented Oct 16, 2024

npuichigo commented Oct 16, 2024

Response parameter CompletionUsage return 0 token consumption #56

Response parameter CompletionUsage return 0 token consumption #56

Comments

moyerlee commented Oct 16, 2024

npuichigo commented Oct 16, 2024