Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response parameter CompletionUsage return 0 token consumption #56

Open
moyerlee opened this issue Oct 16, 2024 · 1 comment
Open

Response parameter CompletionUsage return 0 token consumption #56

moyerlee opened this issue Oct 16, 2024 · 1 comment

Comments

@moyerlee
Copy link

the backend qwen model does not enable the decouple mode(streaming), however i found the openapi response did not show token usage. Below is an example:

ChatCompletion(id='cmpl-45f33530-2dcc-4352-8d97-1dd056efb2e0', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='System: 你是一个知识百科全书助手,可以回答各种问题。\nUser: 什么是牛顿第一定律?\nASSISTANT: 牛顿第一定律,也被称为惯性定律,认为如果一个物体', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1728957474, model='ensemble', object='text_completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=0, prompt_tokens=0, total_tokens=0, completion_tokens_details=None, prompt_tokens_details=None))

@npuichigo
Copy link
Owner

We just cannot return token usage since trtllm backend does't report that for us. To enable that, maybe you need to customize the triton side alongwith the proxy here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants