Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: GGUFed models on AMD GPUs #632

Open
TuzelKO opened this issue Sep 5, 2024 · 1 comment
Open

[Usage]: GGUFed models on AMD GPUs #632

TuzelKO opened this issue Sep 5, 2024 · 1 comment

Comments

@TuzelKO
Copy link

TuzelKO commented Sep 5, 2024

Hello! Having studied the documentation provided, I still could not understand whether there is support for GGUF quantized models on AMD GPU. I would like to use the Q8 or even Q4 model based on Mistral NeMo 12B in my project in order to slightly sacrifice quality for the sake of generation speed. We are planning to build a server with 4-6 Radeon 7900 XTX graphic cards.

image

AMD's solutions look more attractive than Nvidia's solutions in terms of performance/cost and performance/power consumption. Especially for small startups.

I would also like to know whether it is possible to run one small model (for example, Mistral NeMo 12B) in parallel on several graphic cards. This does not mean splitting the model into several cards, but running the same model with full placement in the VRAM on each card. Or will I need to run a separate container for each graphic card?

In our project we are considering using the Magnum v2 12B model (https://huggingface.co/anthracite-org/magnum-v2-12b-gguf). We are currently running it through llama.ccp, but it seems that it is not very well designed to handle parallel requests from multiple users.

@TuzelKO TuzelKO changed the title [New Model]: GGUFed models on AMD GPUs [Usage]: GGUFed models on AMD GPUs Sep 5, 2024
@AlpinDale
Copy link
Member

Hi. GGUF kernels should theoretically work on AMD, but it's untested as I don't have regular access to AMD compute.

Multi-gpu should work fine on AMD. Tensor parallelism will split the model tensors evenly across GPUs. You simply need to launch the model with --tensor-parallel-size X, where X is the number of GPUs. I don't really recommend GGUF for this, because it doesn't seem to scale well at the moment. For AMD, you may want to do either GPTQ or FP8 W8A8 (through llm-compressor).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants