Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Oh you can run the Q8_0 / Q8_K_XL which is nearly equivalent to FP8 (maybe off by 0.01% or less) -> you will need 500GB of VRAM + RAM + Disk space. Via MoE layer offloading, it should function ok


This should work well for MLX Distributed. The low activation MoE is great for multi node inference.


1. What hardware for that. 2. Can you do a benchmark?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: