-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Flash Attention on V100 GPU for Llama-3-VILA1.5-8B Model #109
Comments
lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py Lines 608-609, swap the annotation. |
Thanks, it worked for my GPU now! However, the output is really weird; it outputs a meaningless string of empty spaces and commas. I faced the same issue with another Vision Language Model, while some other Vision Language Models work well. I believe this might be due to the transformers library version. Anyway, I also tried running VILA on the CPU, and in that case, it worked fine. |
Me, too. I've had similar issues with redundant commas and spaces. However, when I use the VILA1.5-3B model to input a video along with some questions, it actually performs better than the 8B model. Sometimes it generates coherent responses, but other times it only replies with one to three words. |
Hi, I also ran into this problem and got weird empty outputs. Could you please share how to solve this problem if you find a way out? |
same problem, redundant commas and spaces. |
我放弃,改用A卡了 |
Hi,
I am encountering an issue when running inference on the Llama-3-VILA1.5-8B model. The error message I receive is:
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
I am using a V100 GPU, which is not an Ampere GPU. Could you please provide guidance on how to disable Flash Attention for this model?
Thank you!
The text was updated successfully, but these errors were encountered: