Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我在A800上流式推理为什么每输出一个token要2秒钟 #88

Open
anliyuan opened this issue Nov 12, 2024 · 4 comments
Open

我在A800上流式推理为什么每输出一个token要2秒钟 #88

anliyuan opened this issue Nov 12, 2024 · 4 comments

Comments

@anliyuan
Copy link

No description provided.

@sixsixcoder
Copy link

可能跟你的软硬件环境有关

@anliyuan
Copy link
Author

可能跟你的软硬件环境有关

大佬,正常情况下应该是多长时间一个token
首token大概多长时间可以返回?

@sixsixcoder
Copy link

sixsixcoder commented Nov 13, 2024

我的软硬件环境

GPU A800-SXM4-80GB
cuda 12.1
torch 2.4.0
torchaudio 2.4.0
transformers 4.45.2
python 3.10
显存 80G
精度 BF16
GPU 个数 1
top_p = 1.0
temperature = 1.0
max_new_tokens = 256

我的测试结果,我迭代了3次,计算了平均首token时延和平均解码时延(仅个人测试,不代表官方评测数据)

Average First Token Time over 3 iterations: 0.0907 seconds
Average Decode Time per Token over 3 iterations: 22.7574 tokens/second

@anliyuan
Copy link
Author

我的软硬件环境

GPU A800-SXM4-80GB
cuda 12.1
torch 2.4.0
torchaudio 2.4.0
transformers 4.45.2
python 3.10
显存 80G
精度 BF16
GPU 个数 1
top_p = 1.0
temperature = 1.0
max_new_tokens = 256

我的测试结果,我迭代了3次,计算了平均首token时延和平均解码时延(仅个人测试,不代表官方评测数据)

Average First Token Time over 3 iterations: 0.0907 seconds
Average Decode Time per Token over 3 iterations: 22.7574 tokens/second

感谢。我再试试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants