Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes! 3bit maybe 4bit can also fit! llama.cpp has MoE offloading so your GPU holds the active experts and non MoE layers, thus you only need 16GB to 24GB of VRAM! I wrote about how to do in this section: https://docs.unsloth.ai/basics/qwen3-coder#improving-generat...


awesome documentation, I'll try this. thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: