guidance load models as int8? #110

silvacarl2 · 2023-05-19T00:54:42Z

silvacarl2
May 19, 2023

how do i call guidance but laod the models as int8 so i can fit them on even an 80Gb GPU?

andysalerno · 2023-05-19T02:00:38Z

andysalerno
May 19, 2023

I know this doesn't answer your question about int8, but might still be helpful-

I got guidance working with GPTQ, allowing you to use 4bit quantized models. It was pretty easy, I just took the model-loading code from GPTQ and then wrote a small subclass of guidance.llms._transformers.Transformers that uses it.

Here's the raw code. It's quite hacky, and can certainly be improved, but for experimenting it does the job. I wrote more about it here.

0 replies

QuangBK · 2023-05-19T02:02:53Z

QuangBK
May 19, 2023

To reduce the VRAM, you can use GPTQ-for-LLaMa.
After you load a model with Huggingface Transformer libary, just use this:

llama = guidance.llms.Transformers(model=model, tokenizer=tokenizer, device=0)
guidance.llm = llama

You may check my code to load the wizard-mega-13B-GPTQ. Hope this may help you!

0 replies

PenutChen · 2023-05-24T01:35:55Z

PenutChen
May 24, 2023

You can simply initialize the model and tokenizer yourself, even with peft.

import torch
from peft import PeftModelForCausalLM as PeftCls
from transformers import AutoModelForCausalLM as ModelCls
from transformers import AutoTokenizer as TkCls

import guidance

model_path = "/path/to/model"
peft_path = "/path/to/peft"
use_peft = True

model: ModelCls = ModelCls.from_pretrained(
    model_path,
    cache_dir=cache_dir,
    device_map="auto",
    load_in_8bit=True,
    torch_dtype=torch.float16,
)

if use_peft:
    model: PeftCls = PeftCls.from_pretrained(model, peft_path)

tokenizer: TkCls = TkCls.from_pretrained(model_path)

guidance.llm = guidance.llms.Transformers(model=model, tokenizer=tokenizer)

0 replies

FieldMarshallVague · 2023-07-27T19:44:25Z

FieldMarshallVague
Jul 27, 2023

@andysalerno Hey, thanks for sharing your work on the GPTQ integration. How did things go with this? Are you still using Guidance or have you moved on? It seems to be a dead project, but I'm struggling to see what people are using instead. LMQL might be a viable alternative, but the syntax seems overly-complex compared to Guidance.

It would be good to know what you thought of Guidance overall. (I still haven't gotten it working locally on a 4090, but am starting to wonder if I'm wasting my time).

Cheers :)

3 replies

andysalerno Jul 27, 2023

It seems to be a dead project

I don't think it's a dead project :D

I've found using GPTQ with exllama is actually pretty simple, and it's been working reliably for me on my 3080 12GB. The comments above describe some good ways to go about it. Here's how I've been doing it. https://github.com/andysalerno/guider/blob/master/llama_autogptq.py

TheSylvester Jul 27, 2023

It's tough to recommend when things have been broken for over a month. Since 0.0.62+. I applaud the maintainer for still making commits, but would appreciate a bit more transparency.

FieldMarshallVague Jul 28, 2023

@andysalerno Thanks, that's interesting. I looked at all the PRs sitting waiting... doesn't seem that active, but clearly a great need for this tech (or something like it).

I tried to get your project up and running on windows, but it fails on the install with a missing Triton package. It seems there is no Windows support for Triton. I saw that your fork had triton commented out, but when I use it, I get an error from transformers about Triton missing. I tried the other forks, too, but that's when I learned it isn't available on Windows.

So, are you using windows? If so, that's promising, it must be a local issue to me.
Which model did you get it working with? I'm finding that I run out of memory with even 7b models, hence the need for quant models.

A related question I hope you don't mind me asking, but when I see the error about "can't use role tags with models that don't support it" from Guidance, is the only option to change model? (or remove tags from prompt, obvs.). Took me ages to realise what it was talking about, but then I couldn't find one that didn't give this issue. I figure it's a 'chat' type model, but I'm not 100% sure which this applies to. I'm looking for GPTQ ones, is that right?

BTW, I tried all of the above approaches. I'm retrying the GPTQ4Llama one currently because I can't remember what the issue was. Oh, it was also Triton not being available on windows. :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guidance load models as int8? #110

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

guidance load models as int8? #110

Replies: 4 comments · 3 replies

Replies: 4 comments 3 replies