For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.
But recently I tried running a smaller model like llama3.2 3B
with 8bit quant and qwen2.5-1.5B-coder
on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).
So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?
What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.
UPDATE: Woah I just tried llama3.1 8B Q4 on ollama again, and what a WORLD of difference to a llama3.2 3B 16fp!
The difference is super massive. The 3B and 1B llama3.2 models seem to be mostly good at summarizing text and maybe generating some JSON based on previous input. But the bigger 3.1 8B model can actually be used in a chat environment! It has a good response length (about 3 lines per message) and it doesn’t stretch out its answer. It seems like a really good model and I will now use it for more complex tasks.
I used to have a 6GB GPU, and around 7B is the sweetspot. This is still the case with newer models, you just have to pick the right model.
Try a IQ4 quantization of Qwen 2.5 7B coder.
Below 3bpw is where its starts to not be worth it, since we have so many open weights availible these days. A lot of people are really stubbern and run 2-3bpw 70B quants, but they are objectively worse than a similarly trained 32B model in the same space, even with exotic, expensive quantization like VPTQ or AQLM: https://huggingface.co/VPTQ-community
Is this VPTQ similar to that 1.58Q I’ve heard about? Where they quantized the Llama 8B down to just 1.5 Bits and it somehow still was rather comprehensive?
No, from what I’ve seen it falls off below 4bpw (just less slowly than other models) and makes ~2.25 bit quants somewhat usable instead of totally impractical, largely like AQLM.
You are thinking of bitnet, which (so far, though not after many tries) requires models to be trained from scratch that way to be effective.