Seeing how impressive the 30b-a3b model already is on my aging desktop at home, this might be worth installing a RAM upgrade...
Is there a source with recommendations for what hardware to use for each model?
The sparse models have a low number of active parameters during each run, while the total number of available parameters can be quite large.
This means that they also run well on systems with limited computation resources (but lots of RAM).
So you can basically run them completely without high-end-GPU on the CPU of of-the-shelf PCs and still easily achieve double-digit token/s values.
I'd add that memory bandwidth is still a relevant factor, so the faster the RAM the faster the inference will be. I think this model would be a perfect fit for the Strix Halo or a >= 64GB Apple Silicon machine, when aiming for CPU-only inference. But mind that llamacpp does not yet support the qwen3-next architecture.
Can confirm that from my setup. Increasing the parallelization beyond 3-4 concurrent threads doesn't also significantly increase the inference speed any more.
This is a telltale sign that some of the cores are starving because data doesn't arrive fast enough any more...
LocalLLaMA
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
Rules:
Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.
Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.
Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.
Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.