[IDEA] Support other quantizations #654

harish0201 · 2024-02-21T21:50:59Z

harish0201
Feb 21, 2024

Hi!

Maybe I overlooked the documentation, but is there a way to:

Use other quantizations other than Q4. I have the RAM and VRAM. I'd like to have better responses.
Use custom endpoints like llama.cpp's server mode? I can see that GPT4All has just a few models, this ties in back with 1 above, since I don't have the limitation of choosing whatever GPT4All has in their repertoire.

harish0201 · 2024-02-21T22:05:29Z

harish0201
Feb 21, 2024
Author

Nevermind, for the first one, I got it.

I symlinked the files from my model folder and renamed: mistral-7b-instruct-v0.2.Q5_K_M.gguf to mistral-7b-instruct-v0.2.Q5_K_M.gguf3.gguf

Still curious about the second one though!

0 replies

debanjum · 2024-02-22T11:14:19Z

debanjum
Feb 22, 2024
Maintainer

Nevermind, for the first one, I got it.

I symlinked the files from my model folder and renamed: mistral-7b-instruct-v0.2.Q5_K_M.gguf to mistral-7b-instruct-v0.2.Q5_K_M.gguf3.gguf

Nice! The symlink was to allow mistral-7b-instruct-v0.2 with the Q5_K_M quantization to work? What's the response quality? Maybe also try some of the other higher-quality Mistral fine-tunes like OpenChat-0106

Use custom endpoints like llama.cpp's server mode? I can see that GPT4All has just a few models, this ties in back with 1 above, since I don't have the limitation of choosing whatever GPT4All has in their repertoire.

Still curious about the second one though!

Try the docs on setting up an OpenAI compatible proxy server to use whatever model you want. Let me know if that doesn't work?

PS: Converting this issue into a Github discussion for now

0 replies

harish0201 · 2024-02-22T20:53:56Z

harish0201
Feb 22, 2024
Author

Thank you!

I tend to generally use Q5_K_M or Q6 if available, and anecdotally, they are more coherent/sensible responses than Q4. I'm using it for chatting with a bunch of articles, raise questions to improve my skills. I also tend to use 13B/34B models at times, so switching the models is a boon.

The symlink was there just so I needn't make a copy of the file, but for some reason when I had the name as just "mistral-7b-instruct-v0.2.Q5_K_M.gguf" it was giving me internal server error, but this seemed to work "mistral-7b-instruct-v0.2.Q5_K_M.gguf3.gguf".

Thank you for the docs link! I'm not sure how I missed it!

2 replies

harish0201 Feb 26, 2024
Author

Also, curiously, I just realized that the responses I get are close to 840-900 token ranges, is there are a reason that this is so consistent? Is there a way to increase the response token length?

debanjum Feb 28, 2024
Maintainer

Interesting, haven't noticed this. Is it consistent when model is looking up notes, internet or even when it doesn't reference these?

Two reasons I can think of:

If it's for only general chat, i.e when Khoj isn't referencing any external data, it'd be that it's using all the remaining new tokens allowed = max-prompt size - default-prompt-to-khoj to respond
If it's for when Khoj is referencing external data only. The external data maybe large enough to fill the max chunk size and the max reference count and so the model is again using all the remaining new tokens allowed to respond.

But across these two scenarios, the response size would be different.

If you're seeing the behavior across when Khoj references external data and not, then I'm not sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IDEA] Support other quantizations #654

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[IDEA] Support other quantizations #654

harish0201 Feb 21, 2024

Replies: 3 comments · 2 replies

harish0201 Feb 21, 2024 Author

debanjum Feb 22, 2024 Maintainer

harish0201 Feb 22, 2024 Author

harish0201 Feb 26, 2024 Author

debanjum Feb 28, 2024 Maintainer

harish0201
Feb 21, 2024

Replies: 3 comments 2 replies

harish0201
Feb 21, 2024
Author

debanjum
Feb 22, 2024
Maintainer

harish0201
Feb 22, 2024
Author

harish0201 Feb 26, 2024
Author

debanjum Feb 28, 2024
Maintainer