I'm considering this rig. What kind of tok/s are you getting with Qwen 3.6?
On the coding front; Bijan Bowen likes to test various llm with coding challenges...Qwen 3.6 is the first small model I've seen actually one shot a few of them (browser os test, 3d flight sim).
Sadly, it borked the third (c++ skateboarding game) but it shows def progress.
The way it failed in pathognomonic - CoT collapse / looping. They need to figure that shit out.
Keep going, Qwen.
https://bannedbyanthropic.com/
I believe the word is capricious. Everything cloud based is at the whim of someone else.
There are ways to mitigate against that, but ultimately if it's not yours...it's not yours.
No.
Clankers were trained on the writing style of individuals such as myself (ASD). While I do consciously (and subconsciously) code switch, I'm aware I have a default "sounds like ChatGPT" voice when dealing with technical discussions, especially if I'm trying to be precise or guard myself from accusation or attack.
I'm ok with it, but I'm now going to autism at you / over explain it, partially because I think it might help you parse the difference between human and machine when reading these things.
You asked in apparently good faith and you deserve a full explanation.
The underlying pattern comes from a perversion of the "measure twice, cut once" mentality; I create the crux of the argument, forecast likely objections, rewrite to close off said objections, sand the edges off, check if I was unintentionally offensive, check if I presented the facts to the best of my ability, check for logical fallacies, then finally sweep to see if there is any ambiguity or obvious attack surfaces left. Then I read it out loud to myself.
That mode flattens everything into a "safe, palatable, high signal to noise ratio, use dot points so people don't lose you, don't write like yourself" style.
(And I still miss typos sometimes. That actually really, really irks me).
Anyhow, when I said the clankers copy us, I didn't mean just vocabulary. Expand the CoT (chain of thought) the next time you use ChatGPT; you'll see they made it do this exact same process.
PS: You're not the first to point it out either. It's one of the reasons I dubbed my blog Clanker Adjacent
Oh, that was the other tedious part.
I iterated the questions with Claude, ChatGPT, GLM and Mimo (use Open Router with $10 of credit; more than enough).
Slow and tedious...but I knew what I wanted to ask and knew the answer broadly. I formed the question, got Claude to respond and tighten the question, then passed it onto GPT to do same. Then GLM. Then Mimo. Each round, I would note the similar and different points and extract them as part of model answer. Then, feed the iterated question into a fresh Claude and say "here is the question, here is what I think the answer should contain; push back?". Seeing I was trying to measure against Claude-like, that felt OK to me.
I don't think you need to be a domain expert. You just need to pay attention, extract data and ask questions. Between you and four LLMs you'll surely be able to come up with 10 questions that actually matter / mirror what's important to you. It's more project management than anything else.
Professionally? No. I use to be a uni lecturer, so this sort of rubric design by expert consensus (Delphi) is pretty familiar to me.
You're right that the llm benchmarks are opaque. I have no idea of the normative values. Hell, even finding the raw test banks is tricky. So, I made my own. They skew heavily to the domains I care about, so probably not generalisable. OTOH, methodology should work.
I think I design my stuff from s very different school of thought than CS people. My first and most guiding principle is "I don't trust the llm. Its needs to earn my trust by showing its work". If you take any of the SOTA cloud models and point them at llama-conductor and ask them to inspect the code base, they will show you what I mean. Hell, point them at this thread.
Interesting: Sonnet 4.6's performance has dropped significantly today and the status page is showing corresponding service issues. Probably the 4.7 rollout?
Random degradation in performance is one of the other annoyance of cloud based inference / another reason to roll your own.
If you mean "keep as much local as you can, route to cloud for heavy tasks?", then we probably agree.
IMHO there stuff that just out of reach for most consumer grade local rigs.
Even the bigger open weight models are now in the 1T range. The new Kimi 2.6 can technically be installed at home....if your home happened to have a $300K server.
I think maybe the real benefit of these large, competitive, open weight models is that they stop the established players (eg: OAI, Anthropic) from just cranking the costs upwards too far / running roughshod. If you can pay $10 a month to access something that is very nearly as good as SOTA, then....why would you pay $100 - $300?
I think you'd be surprised at how common that use case is. Sentiment analysis, advice, help and general chat.
Technical coding stuff makes up maybe 5%.
Split as needed.
You would think that the premier cloud-based service would have that on lock. Their own internal polling shows the trend.
https://www.anthropic.com/research/economic-index-march-2026-report
What I think will end up happening is that the online providers will crater their good will as one stop shops and trend towards business use cases / API. Maybe that's just me being a cynic.
It looks very, very, promising. I don't have anything like the rig needed to run it locally, but I can see it was added to Opencode Go ($10/month) bundle.
That's outstanding!
It's really hard for me to answer this question without pointing to my project, because the project is sort of directly in response to this very problem. So, gauche as it may be, fuck it:
https://codeberg.org/BobbyLLM/llama-conductor
I mention this because 1) I am NOT trying to get you to install my shit but 2) my shit answers this directly. I note the conflict of interest, but OTOH you did ask me, and I sort of solved it in my way so...fuck. (It's FOSS / I'm not trying to sell you anything etc etc).
With that out of the way, I will answer from where I am sitting and then generically (if I understand your question right).
Basically -
Small models have problems with how much they can hold internally. There's a finite meta-cognitive "headspace" for them to work with...and the lower the quant, the fuzzier that gets. Sadly, with weaker GPU, you're almost forced to use lower quants.
If you can't upgrade the LLM (due to hardware), what you need to do is augment it with stuff that takes on some of the heavy lifting.
What I did was this: I wrapped a small, powerful, well-benchmarking LLM in an infrastructure that takes the things it's bad at outside of its immediate concern.
Bad inbuilt model priors / knowledge base? No problem; force answers to go thru a tiered cascade.
Inbuilt quick responses that you define yourself as grounding (cheatsheets) --> self-populating wiki-like structure (you drop in .md into one folder, hit >>summ and it cross-updates everywhere) --> wikipedia short lookup (800 character open box: most wiki articles are structured with the TL;DR in that section) --> web search (using trusted domains) or web synth (using trusted domains plus cross-verification) --> finally....model pre-baked priors.
In my set up, the whole thing cascades from highest trust to lowest trust (as defined by the human), stops when it gathers the info it needs and tells you where the answer came from.
Outside of that, sidecars that do specific things (maths solvers, currency look up tools, weather look up, >>judge comparitors...tricks on tricks on tricks).
Based on my tests, with my corpus (shit I care about) I can confidently say my little 4B can go toe to toe with any naked 100B on my stuff. That's a big claim, and I don't expect you to take it at face value. It's a bespoke system with opinions...but I have poked it to death and it refuses to die. So...shrug. I'm sanguine.
Understand: I assume the human in the middle is the ultimate arbiter of what the LLM reasons over. This is a different school of thought to "just add more parameters, bro" or "just get a better rig, bro", but it was my solution to constrained hardware and hallucinations.
There are other schools of thought. Hell, others use things like MCP tool calls. The model pings cloud or self-host services (like farfalle or Perplexica), calls them when it decides it needs to, and the results land in context. But that's a different locus of control; the model's still driving...and I'm not a fan of that on principle. Because LLMs are beautiful liars and I don't trust them.
The other half of the problem isn't knowledge - it's behaviour.
Small models drift. They go off-piste, ignore your instructions halfway through a long response, or confidently make shit up when they hit the edge of what they know. So the other thing I built was a behavioural shaping layer that keeps the model constrained at inference time - no weight changes, just harness-level incentive structure. Hallucination = retry loop = cost. Refusal = path of least resistance. You're not fixing the model; you're making compliance (mathematically) cheaper than non-compliance.
That's how I solved it for me. YMMV.
On 16GB VRAM: honestly, that's decent - don't let GPU envy get to you. You can comfortably run a Q4_K_M of a 14B model entirely in VRAM at usable speeds - something like Qwen3-14B or Mistral-Small. Those are genuinely capable; not frontier, but not a toy either. The painful zone is 4-8GB (hello!), where you're either running small models natively or offloading layers to RAM and watching your tokens-per-second crater. You can do some good stuff with a 14B, augmented with the right tools.
Where to start the rabbit hole: Do you mean generally? Either Jan.ai or LM Studio is the easiest on-ramp - drag and drop models, built-in chat UI, handles GGUF out of the box. IIRC, Jan has direct MCP tooling as well.
Once you want more control, drop into llama.cpp directly. It's just...better. Faster. Fiddlier, yes...but worth it.
For finding good models, Unsloth's HuggingFace page is consistently one of the better curators of well-quantised GGUFs. After that it's just... digging through LocalLLaMA and benchmarking stuff yourself.
There's no substitute for running your own evals on your own hardware for your own use case - published benchmarks will lie to you. If you're insane enough to do that, see my above "rubric" post.
Not sure...have I answered your question?
PS: for anyone that hits the repo and reads the 1.9.5 commit message - enjoy :) Twas a mighty fine bork indeed, worthy of the full "Bart Simpson writes on chalkboard x 1000" hall of shame message. Fucking Vscodium man....I don't know how sandbox mode got triggered but it did and it ate half my frikken hard-drive and repo before I could stop it. Rookie shit.
SuspciousCarrot78
0 post score0 comment score





If you do, please write back and let us all know how it fares. I think a lot of us are pinning our hopes of the Qwen line as "Claude at Home". I don't think 3.6 is it...but 3.7? 3.8?