You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00, 1.35s/it]
Running on local URL: http://0.0.0.0:80
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/gradio/queueing.py", line 536, in process_events
response = await route_utils.call_process_api(
File "/usr/local/lib/python3.8/dist-packages/gradio/route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1935, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1532, in call_function
prediction = await utils.async_iteration(iterator)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 671, in async_iteration
return await iterator.anext()
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 664, in anext
return await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 647, in run_sync_iterator_async
return next(iterator)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 809, in gen_wrapper
response = next(iterator)
File "web_demo.py", line 126, in predict
for response in model.chat_stream(tokenizer, _query, history=_task_history, generation_config=config):
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1216, in stream_generator
for token in self.generate_stream(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/usr/local/lib/python3.8/dist-packages/transformers_stream_generator/main.py", line 931, in sample_stream
outputs = self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1045, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 893, in forward
outputs = block(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 612, in forward
attn_outputs = self.attn(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 432, in forward
query = apply_rotary_pos_emb(query, q_pos_emb)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1344, in apply_rotary_pos_emb
return apply_rotary_emb_func(t_float, cos, sin).type_as(t)
File "/usr/local/lib/python3.8/dist-packages/flash_attn/layers/rotary.py", line 122, in apply_rotary_emb
return ApplyRotaryEmb.apply(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/flash_attn/layers/rotary.py", line 48, in forward
out = apply_rotary(
File "/usr/local/lib/python3.8/dist-packages/flash_attn/ops/triton/rotary.py", line 213, in apply_rotary
rotary_kernel[grid](
File "", line 41, in rotary_kernel
File "/usr/local/lib/python3.8/dist-packages/triton/compiler.py", line 1629, in compile
metadata["name"] = ptx_get_kernel_name(next_module)
File "/usr/local/lib/python3.8/dist-packages/triton/compiler.py", line 1040, in ptx_get_kernel_name
assert ptx
AssertionError
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
==========
== CUDA ==
CUDA Version 11.7.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00, 1.35s/it]
Running on local URL: http://0.0.0.0:80
To create a public link, set
share=True
inlaunch()
.User: 柔柔弱弱
Traceback (most recent call last):
File "", line 21, in rotary_kernel
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-14de7de5c4da5794c8ca14e7e41a122d-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float32, torch.float32, torch.float32, torch.float32, None, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, False, False, False, False, 4), (True, True, True, True, (False,), (True, False), (False, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), (True, False), (False, True)))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/gradio/queueing.py", line 536, in process_events
response = await route_utils.call_process_api(
File "/usr/local/lib/python3.8/dist-packages/gradio/route_utils.py", line 322, in call_process_api
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1935, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1532, in call_function
prediction = await utils.async_iteration(iterator)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 671, in async_iteration
return await iterator.anext()
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 664, in anext
return await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 647, in run_sync_iterator_async
return next(iterator)
File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 809, in gen_wrapper
response = next(iterator)
File "web_demo.py", line 126, in predict
for response in model.chat_stream(tokenizer, _query, history=_task_history, generation_config=config):
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1216, in stream_generator
for token in self.generate_stream(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/usr/local/lib/python3.8/dist-packages/transformers_stream_generator/main.py", line 931, in sample_stream
outputs = self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1045, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 893, in forward
outputs = block(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 612, in forward
attn_outputs = self.attn(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 432, in forward
query = apply_rotary_pos_emb(query, q_pos_emb)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1344, in apply_rotary_pos_emb
return apply_rotary_emb_func(t_float, cos, sin).type_as(t)
File "/usr/local/lib/python3.8/dist-packages/flash_attn/layers/rotary.py", line 122, in apply_rotary_emb
return ApplyRotaryEmb.apply(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/flash_attn/layers/rotary.py", line 48, in forward
out = apply_rotary(
File "/usr/local/lib/python3.8/dist-packages/flash_attn/ops/triton/rotary.py", line 213, in apply_rotary
rotary_kernel[grid](
File "", line 41, in rotary_kernel
File "/usr/local/lib/python3.8/dist-packages/triton/compiler.py", line 1629, in compile
metadata["name"] = ptx_get_kernel_name(next_module)
File "/usr/local/lib/python3.8/dist-packages/triton/compiler.py", line 1040, in ptx_get_kernel_name
assert ptx
AssertionError
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
备注 | Anything else?
No response
The text was updated successfully, but these errors were encountered: