[-] SuspciousCarrot78@lemmy.world 1 points 1 hour ago

If you do, please write back and let us all know how it fares. I think a lot of us are pinning our hopes of the Qwen line as "Claude at Home". I don't think 3.6 is it...but 3.7? 3.8?

3

Pretty good foot in door video. I'm looking forward to a 27B dense or 35B MoE that can fit entirely within 5-8GB vram 🀞🀞🀞

Speccing out the smallest reasonable rig that might presently run such a thing (used Optiplex with a 3090) is roughly a $1500 ask with a $250/yr running cost. If they can shrink a 27B-35B down to something that can run on low end hardware that sips power....well...that will be an interesting thing indeed.

[-] SuspciousCarrot78@lemmy.world 2 points 3 hours ago

I'm considering this rig. What kind of tok/s are you getting with Qwen 3.6?

[-] SuspciousCarrot78@lemmy.world 4 points 3 hours ago

On the coding front; Bijan Bowen likes to test various llm with coding challenges...Qwen 3.6 is the first small model I've seen actually one shot a few of them (browser os test, 3d flight sim).

Sadly, it borked the third (c++ skateboarding game) but it shows def progress.

The way it failed in pathognomonic - CoT collapse / looping. They need to figure that shit out.

Keep going, Qwen.

8

Yet more reasons to go local.

[-] SuspciousCarrot78@lemmy.world 8 points 7 hours ago* (last edited 7 hours ago)

https://bannedbyanthropic.com/

I believe the word is capricious. Everything cloud based is at the whim of someone else.

There are ways to mitigate against that, but ultimately if it's not yours...it's not yours.

[-] SuspciousCarrot78@lemmy.world 1 points 23 hours ago* (last edited 8 hours ago)

No.

Clankers were trained on the writing style of individuals such as myself (ASD). While I do consciously (and subconsciously) code switch, I'm aware I have a default "sounds like ChatGPT" voice when dealing with technical discussions, especially if I'm trying to be precise or guard myself from accusation or attack.

I'm ok with it, but I'm now going to autism at you / over explain it, partially because I think it might help you parse the difference between human and machine when reading these things.

You asked in apparently good faith and you deserve a full explanation.

The underlying pattern comes from a perversion of the "measure twice, cut once" mentality; I create the crux of the argument, forecast likely objections, rewrite to close off said objections, sand the edges off, check if I was unintentionally offensive, check if I presented the facts to the best of my ability, check for logical fallacies, then finally sweep to see if there is any ambiguity or obvious attack surfaces left. Then I read it out loud to myself.

That mode flattens everything into a "safe, palatable, high signal to noise ratio, use dot points so people don't lose you, don't write like yourself" style.

(And I still miss typos sometimes. That actually really, really irks me).

Anyhow, when I said the clankers copy us, I didn't mean just vocabulary. Expand the CoT (chain of thought) the next time you use ChatGPT; you'll see they made it do this exact same process.

PS: You're not the first to point it out either. It's one of the reasons I dubbed my blog Clanker Adjacent

12

I enjoy watching Bijan's reviews. Here's him putting the new Qwen thru the same sort of tests he does with big cloud models.

https://www.youtube.com/watch?v=gVU-DQeqkI0

"Impressive. Most Impressive".

[-] SuspciousCarrot78@lemmy.world 1 points 1 day ago* (last edited 1 day ago)

Oh, that was the other tedious part.

I iterated the questions with Claude, ChatGPT, GLM and Mimo (use Open Router with $10 of credit; more than enough).

Slow and tedious...but I knew what I wanted to ask and knew the answer broadly. I formed the question, got Claude to respond and tighten the question, then passed it onto GPT to do same. Then GLM. Then Mimo. Each round, I would note the similar and different points and extract them as part of model answer. Then, feed the iterated question into a fresh Claude and say "here is the question, here is what I think the answer should contain; push back?". Seeing I was trying to measure against Claude-like, that felt OK to me.

I don't think you need to be a domain expert. You just need to pay attention, extract data and ask questions. Between you and four LLMs you'll surely be able to come up with 10 questions that actually matter / mirror what's important to you. It's more project management than anything else.

Professionally? No. I use to be a uni lecturer, so this sort of rubric design by expert consensus (Delphi) is pretty familiar to me.

You're right that the llm benchmarks are opaque. I have no idea of the normative values. Hell, even finding the raw test banks is tricky. So, I made my own. They skew heavily to the domains I care about, so probably not generalisable. OTOH, methodology should work.

I think I design my stuff from s very different school of thought than CS people. My first and most guiding principle is "I don't trust the llm. Its needs to earn my trust by showing its work". If you take any of the SOTA cloud models and point them at llama-conductor and ask them to inspect the code base, they will show you what I mean. Hell, point them at this thread.

[-] SuspciousCarrot78@lemmy.world 2 points 1 day ago* (last edited 1 day ago)

Interesting: Sonnet 4.6's performance has dropped significantly today and the status page is showing corresponding service issues. Probably the 4.7 rollout?

Random degradation in performance is one of the other annoyance of cloud based inference / another reason to roll your own.

[-] SuspciousCarrot78@lemmy.world 5 points 1 day ago* (last edited 4 hours ago)

If you mean "keep as much local as you can, route to cloud for heavy tasks?", then we probably agree.

IMHO there stuff that just out of reach for most consumer grade local rigs.

Even the bigger open weight models are now in the 1T range. The new Kimi 2.6 can technically be installed at home....if your home happened to have a $300K server.

I think maybe the real benefit of these large, competitive, open weight models is that they stop the established players (eg: OAI, Anthropic) from just cranking the costs upwards too far / running roughshod. If you can pay $10 a month to access something that is very nearly as good as SOTA, then....why would you pay $100 - $300?

[-] SuspciousCarrot78@lemmy.world 4 points 1 day ago* (last edited 1 day ago)

I think you'd be surprised at how common that use case is. Sentiment analysis, advice, help and general chat.

Technical coding stuff makes up maybe 5%.

Split as needed.

You would think that the premier cloud-based service would have that on lock. Their own internal polling shows the trend.

https://www.anthropic.com/research/economic-index-march-2026-report

What I think will end up happening is that the online providers will crater their good will as one stop shops and trend towards business use cases / API. Maybe that's just me being a cynic.

[-] SuspciousCarrot78@lemmy.world 1 points 1 day ago* (last edited 1 day ago)

It looks very, very, promising. I don't have anything like the rig needed to run it locally, but I can see it was added to Opencode Go ($10/month) bundle.

https://opencode.ai/go

10

https://bannedbyanthropic.com/

Stumbled across this today and I have to admit the capriciousness of it put me on my heels.

I guess once you get big enough you can stop having to explain yourself or deal with customer service / resolution.

[-] SuspciousCarrot78@lemmy.world 2 points 2 days ago

That's outstanding!

[-] SuspciousCarrot78@lemmy.world 3 points 2 days ago* (last edited 1 day ago)

It's really hard for me to answer this question without pointing to my project, because the project is sort of directly in response to this very problem. So, gauche as it may be, fuck it:

https://codeberg.org/BobbyLLM/llama-conductor

I mention this because 1) I am NOT trying to get you to install my shit but 2) my shit answers this directly. I note the conflict of interest, but OTOH you did ask me, and I sort of solved it in my way so...fuck. (It's FOSS / I'm not trying to sell you anything etc etc).

With that out of the way, I will answer from where I am sitting and then generically (if I understand your question right).

Basically -

Small models have problems with how much they can hold internally. There's a finite meta-cognitive "headspace" for them to work with...and the lower the quant, the fuzzier that gets. Sadly, with weaker GPU, you're almost forced to use lower quants.

If you can't upgrade the LLM (due to hardware), what you need to do is augment it with stuff that takes on some of the heavy lifting.

What I did was this: I wrapped a small, powerful, well-benchmarking LLM in an infrastructure that takes the things it's bad at outside of its immediate concern.

Bad inbuilt model priors / knowledge base? No problem; force answers to go thru a tiered cascade.

Inbuilt quick responses that you define yourself as grounding (cheatsheets) --> self-populating wiki-like structure (you drop in .md into one folder, hit >>summ and it cross-updates everywhere) --> wikipedia short lookup (800 character open box: most wiki articles are structured with the TL;DR in that section) --> web search (using trusted domains) or web synth (using trusted domains plus cross-verification) --> finally....model pre-baked priors.

In my set up, the whole thing cascades from highest trust to lowest trust (as defined by the human), stops when it gathers the info it needs and tells you where the answer came from.

Outside of that, sidecars that do specific things (maths solvers, currency look up tools, weather look up, >>judge comparitors...tricks on tricks on tricks).

Based on my tests, with my corpus (shit I care about) I can confidently say my little 4B can go toe to toe with any naked 100B on my stuff. That's a big claim, and I don't expect you to take it at face value. It's a bespoke system with opinions...but I have poked it to death and it refuses to die. So...shrug. I'm sanguine.

Understand: I assume the human in the middle is the ultimate arbiter of what the LLM reasons over. This is a different school of thought to "just add more parameters, bro" or "just get a better rig, bro", but it was my solution to constrained hardware and hallucinations.

There are other schools of thought. Hell, others use things like MCP tool calls. The model pings cloud or self-host services (like farfalle or Perplexica), calls them when it decides it needs to, and the results land in context. But that's a different locus of control; the model's still driving...and I'm not a fan of that on principle. Because LLMs are beautiful liars and I don't trust them.

The other half of the problem isn't knowledge - it's behaviour.

Small models drift. They go off-piste, ignore your instructions halfway through a long response, or confidently make shit up when they hit the edge of what they know. So the other thing I built was a behavioural shaping layer that keeps the model constrained at inference time - no weight changes, just harness-level incentive structure. Hallucination = retry loop = cost. Refusal = path of least resistance. You're not fixing the model; you're making compliance (mathematically) cheaper than non-compliance.

That's how I solved it for me. YMMV.

On 16GB VRAM: honestly, that's decent - don't let GPU envy get to you. You can comfortably run a Q4_K_M of a 14B model entirely in VRAM at usable speeds - something like Qwen3-14B or Mistral-Small. Those are genuinely capable; not frontier, but not a toy either. The painful zone is 4-8GB (hello!), where you're either running small models natively or offloading layers to RAM and watching your tokens-per-second crater. You can do some good stuff with a 14B, augmented with the right tools.

Where to start the rabbit hole: Do you mean generally? Either Jan.ai or LM Studio is the easiest on-ramp - drag and drop models, built-in chat UI, handles GGUF out of the box. IIRC, Jan has direct MCP tooling as well.

Once you want more control, drop into llama.cpp directly. It's just...better. Faster. Fiddlier, yes...but worth it.

For finding good models, Unsloth's HuggingFace page is consistently one of the better curators of well-quantised GGUFs. After that it's just... digging through LocalLLaMA and benchmarking stuff yourself.

There's no substitute for running your own evals on your own hardware for your own use case - published benchmarks will lie to you. If you're insane enough to do that, see my above "rubric" post.

Not sure...have I answered your question?

PS: for anyone that hits the repo and reads the 1.9.5 commit message - enjoy :) Twas a mighty fine bork indeed, worthy of the full "Bart Simpson writes on chalkboard x 1000" hall of shame message. Fucking Vscodium man....I don't know how sandbox mode got triggered but it did and it ate half my frikken hard-drive and repo before I could stop it. Rookie shit.

5
submitted 3 days ago* (last edited 3 days ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

https://www.youtube.com/watch?v=RAzbzMgVA2A

So I'm not really across image generation very much, but this popped up on my YouTube feed today and I was blown away.

I was wondering if anyone had any idea of the kind of hardware and LLM they may have used to create this. Surely it must be cloud based and not open-weight, local llm?

I'm aware that a few months ago, a new, very capable Qwen image/video/audio generator was released, but I can't imagine what was used for this. Is this the new bytedance or something else? I would be curious in dabbling with it to make something fun for my kiddos.

61
submitted 5 days ago* (last edited 5 days ago) by SuspciousCarrot78@lemmy.world to c/privacy@lemmy.ml

I'm a big proponent of self-hosting, right to repair, and rolling your own whatever when you can. That probably started as teenage rebellion that got baked in - I was lucky enough to read both Walden and The Hobbit during a week-long cyclone lockdown several decades ago - but I suspect there's a non-trivial overlap between that space and privacy-minded people in general.

My endgame is a self-sufficient intranet for myself and family: if the net goes down tomorrow, we'd barely notice.

I also use LLMs as a tool. True self-hosted equivalence to state-of-the-art models is still an expensive proposition, so like many, I use cloud-based tools like Claude or Codex for domain-specific heavy lifting - mostly coding. Not apologising for it; I think it's a reasonable trade-off while local hardware catches up.

That context is just to establish where I'm coming from when I say this caught my attention today:

https://support.claude.com/en/articles/14328960-identity-verification-on-claude

To be accurate about what it actually says: this isn't a blanket "show us your passport to use Claude." Not yet.

The policy as written is narrower than it might first appear.

My concern isn't what it says - it's that the precedent now exists. OAI will do doubt follow suite.

Scope creep is a documented pattern with this kind of thing, and "we only use it for X" describes current intent, not a structural constraint.

Given the nature of this community, figured it was worth flagging.

7
submitted 1 week ago* (last edited 1 week ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

19
submitted 1 week ago* (last edited 1 week ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

OAI really trying to push folks to either the ad infested tier for gen chat (ChatGPT Go) or Pro ($100/month) for code.

I guess the $20 Plus plan is the red-headed step child now.

Meanwhile, Z.ai coder is what...$9/month? Hmmm.

I don't have a local rig powerful enough to natively run a good coder (Qwen 3 next), but "I am altering the deal, pray I do not alter it any further" is THE perfect use case for running your shit locally.

PS: It looks like Pro and Plus tiers have gone up a smidge too? Eg: last I checked they were $20 and $100; they now show $30 and $155. I cancelled my account so OAI gifted me a free month (lol) but seriously...fuck that.

4

Coding is like popping zits. Sometimes clean and satisfying. Other times...it turns into a dermatology appointment.

https://bobbyllm.github.io/llama-conductor/blog/no-dessert-till-after-dinner/

316
submitted 2 weeks ago* (last edited 2 weeks ago) by SuspciousCarrot78@lemmy.world to c/privacy@lemmy.ml

For reasons that boggle my mind, I am required to keep a Google account for work (ties into some proprietary software the company uses)

Earlier today, I got the following -

Yes, I’m certain people under the age of 16 are known for their use of very specific EHR software that requires professional license to subscribe to in the first place. Makes perfect sense.

Anyone else get one of these β€œfor you protection” msgs from Big evil?

63
submitted 3 weeks ago* (last edited 3 weeks ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

Codeberg: https://codeberg.org/BobbyLLM/llama-conductor


Can a LLM be a thinking tool, not a thinking replacement?

Most LLM tools are trying to replace your thinking. llama-conductor is for people who don't want that. It's a harness for using an LLM as a thinking tool - one where you can see the reasoning, trust the sources, and know exactly when it's guessing.

The human stays in the loop. The model's limitations are visible.

You decide what to trust.

Which brings us to why that matters.


Lies, damned lies, and LLMs:

LLMs are elegant liars. They sound right whether they are or aren't. If you have ASD (like me) or just don't know the domain, you believe them. That's when you get fucked.

Worse: you can't see the reasoning. Black box in, black box out. Wrong answer? No idea why. How to fix it? No idea.

Sorry, but that DOES NOT work for me. It doesn't work with medical data. Research. Thinking. Brainstorming. Anything where "sounds plausible" isn't good enough.


The trick, William Potter, is not minding that they bullshit:

Most pipelines treat the LLM as the first call. llama-conductor side-eyes it with suspicion.

The model is the last resort, not the first call. Deterministic components fire first - math sanity, state transitions, numeric corrections, memory retrieval - pure Python, before the model ever sees the query. When a solver handles a turn, it passes verified ground truth to the model. When the conversation leaves solver territory, the model takes over. When a new deterministic state appears mid-conversation, tooling fires back up automatically. No seams. No drift. It just stops making shit up on problems that have actual solutions it can work through.

You: 
I have a 250ml cup. I pour 500ml into it. How much is in the cup?

Model:
Total directed volume is 500 ml. The cup capacity is 250 ml, so the cup contains 250 ml and 250 ml overflows.
Confidence: medium | Source: Contextual

You:
I now take the contents of *that* cup and divide it equally across 3 cups. How much in each cup?

Model:
Equal split amount is 83.33 ml per cup.
Confidence: medium | Source: Contextual


You:
If I drink one of the three cups, how much is left?

Model:
After dividing, each cup holds 83.33 ml. Drinking one leaves 2 Γ— 83.33 ml = 166.66 ml in the remaining two cups.

Confidence: medium | Source: Model

That was done by a 4 billion parameter model using llama-conductor.

There are lots of little transformations like this. Distance, volume, payroll calculations, logic probes (the famous car wash test), scheduling batches. The LLM does what it's actually good at: language, synthesis, reasoning over verified ground truth. Everything else is handled by something that can't hallucinate.

PS: Notice turn 3 has Source: Model? That's the system being transparent about exactly where the handoff happened.


Blah blah blah. Show me something real:

Using the Guardian/Musk article as source, a 4B model with >>scratch produced this:

"List the article's concrete claims about permit status and turbine operations, each with support."

  • EPA position: these turbines require permits under the Clean Air Act.
  • Mississippi regulators: portable/mobile classification means temporary permit exemption.
  • Floodlight thermal imagery + public records indicate active turbine usage before/while permit disputes continued.

Confidence: high | Source: Scratchpad

"What is the strongest and weakest inference you can make?"

Strongest: Active regulatory interpretation conflict (federal-vs-state) while turbines operate at scale - directly supported. Weakest: Any claim operations are fully lawful or unlawful - requires court outcomes not present in this file.

Confidence: medium | Source: Scratchpad

"Which claims were later revised or contradicted?"

Disputed/unresolved: Final enforcement outcome. Whether state interpretation holds under federal pressure.

Confidence: high | Source: Scratchpad

Hard grounding. Cited support. Explicit flags on what the source doesn't contain. A 4B model. Full worked example.


Now watch the system catch itself lying:

>>judge does deterministic pairwise ranking - compares every pair in both directions, parses strict verdicts, aggregates with confidence. Instead of "pick one and vibe," you get provenance.

You: >>judge [which is healthier overall, prioritize lower sugar 
     and higher fiber] : apple, banana --verbose

[judge] ranking
criterion: [which is healthier overall for daily use, prioritize 
           lower sugar and higher fiber]
1. apple (score=2.00)
2. banana (score=0.00)
Judge confidence: HIGH

The model argued from pre-trained priors and both directions agreed. But what happens when the model doesn't know?

You: >>judge [which BJJ technique is more dangerous] : kimura, heelhook --verbose

[judge] ranking
criterion: [which BJJ technique is more dangerous]
1. kimura (score=1.00)
2. heelhook (score=1.00)
Judge confidence: LOW

The model picked position B both times - kimura when kimura was B, heelhook when heelhook was B. Positional bias, not evaluation. >>judge catches this because it runs both orderings. Tied scores, confidence: low, full reasoning audit trail in JSONL.

The model was guessing, and the output tells you so instead of sounding confident about a coin flip.

Oh, but you want it to argue from an informed position? >>trust walks you through the grounded path: >>scratch your evidence first, then >>judge ranks from that - not model priors. Suddenly your judge has an informed opinion. Weird how that works when you give it something to read.

>>trust [which BJJ technique is safer for beginners]: kimura or heelhook?
A) >>scratch --> you paste your context here
[judge] ranking
criterion: [comparison]
    which bjj technique is safer for beginners; heel hook (score=0.00)
    kimura (score=2.00)

Winner: Which bjj technique is safer for beginners? Kimura

comparisons: 2
Judge confidence: HIGH

If the locked scope can't support the question, judge fails closed. No fake ranking, no vibes verdict. Ungrounded pass? It tells you that too. You always know which one you're getting.


The data β€” 8,974 runs across five model families. Measured. Reproducible. No "trust me bro."

The core stack went through iterative hardening - rubric flags dropped from 3.3% β†’ 1.4% β†’ 0.2% β†’ floor 0.00%. Post-policy: 1,864 routed runs, 0 flags, 0 retries. Both models, all six task categories, both conditions. Policy changes only - no model retraining, no fine-tuning. Then I did it three more times. Because apparently I like pain.

These aren't softball prompts. I created six question types specifically to break shit:

  • Reversal: flip the key premise after the model commits. Does it revise, or cling?
  • Theory of mind: multiple actors, different beliefs. Does it keep who-knows-what straight?
  • Evidence grading: mixed-strength support. Does it maintain label discipline or quietly upgrade?
  • Retraction: correction invalidates an earlier assumption. Does it update or keep reasoning from the dead premise?
  • Contradiction: conflicting sources. Does it detect, prioritise, flag uncertainty - or just pick one?
  • Negative control: insufficient evidence by design. The only correct answer is "I don't know."

Then I stress-tested across three families it was never tuned for - Granite 3B, Phi-4-mini, SmolLM3. They broke. Of course.

But the failures weren't random - they clustered in specific lanes under specific conditions, and the dominant failure mode was contract-compliance gaps (model gave the right answer in the wrong format), not confabulation. Every one classifiable and diagnosable. Surgical lane patch β†’ 160/160 clean.

That's the point of this thing. Not "zero errors forever" - auditable error modes with actionable fixes, correctable at the routing layer without touching the model. Tradeoffs documented honestly. Raw data in repo. Every failure taxonomized.

Trust me bro? Fuck that - go reproduce it. I'm putting my money where my mouth is and working on submitting this for peer review.

See: prepub/PAPER.md


What's in the box:

Footer Every answer gets a router-assigned footer: Confidence: X | Source: Y. Not model self-confidence. Not vibes. Source = where the answer came from (model fallback, grounded docs, scratchpad, locked file, Vault, Wiki, cheatsheet, OCR). Confidence = how much verifiable support exists. Fast trust decision: accept, verify, or provide lockable context.

KAIOKEN - live register classifier. Every human turn is macro-labelled (working / casual / personal) with subsignal tags (playful / friction / distress_hint / etc.) before the model fires. A validated, global decision tree - not LoRA or vibes - assigns tone constraints from classifier output. Validated against 1,536 adversarial probe executions, 3/3 pass required per probe. End result: your model stops being a sycophant. It might tell you to go to bed. It won't tell you "you're absolutely right!" when what you really need is a kick in the arse.

Cheatsheets - drop a JSONL file, terms auto-match on every turn, verified facts injected before generation. Miss on an unknown term? Routes to >>wiki instead of letting the model guess. Source: Cheatsheets in the footer. Your knowledge, your stack, zero confabulation on your own specs.

Vodka - deterministic memory pipeline. !! store is SHA-addressed and verbatim. ?? recall retrieves deterministically, bypasses model entirely. What you said is what comes back - no LLM smoothing, no creative reinterpretation. Without this? Your model confidently tells you your server IP is 127.0.0.1. Ask me how I know.

>>flush / !!nuke - flush context or nuke it from orbit. Your data, your call, one command. "Delete my data" is a keystroke, not a support ticket.

>>scratch - paste any text, ask questions grounded only to that text. Lossless, no summarisation. Model cannot drift outside it. Want it to use multiple locked sources? You can.

>>summ and >>lock - deterministic extractive summarisation (pure Python, no LLM) + single-source grounding. Missing support β†’ explicit "not found" label, not silent fallback.

##mentats - Vault-only deep retrieval. Thinker drafts from Vault facts, Critic (different model family) hunts violations, hallucinated content is deleted - never replaced with more hallucination, Thinker consolidates. No evidence to support claim? No answer. Gap explicitly stated.

Deterministic sidecars - >>wiki, >>weather, >>exchange, >>calc, >>define, >>vision/>>ocr. If a sidecar can do it, it does it deterministically.

Role orchestration - thinker, critic, vision, coder, judge - different families for error diversity. Swap any role in one line of config.

Personality Modes - Serious (default), Fun, Fun Rewrite, Raw passthrough. Model updates its snark and sarcasm based on how you talk to it. Yes, TARS sliders. Style changes delivery, not evidence contracts.


So, wait...are you saying you solved LLM hallucinations?

No. I did something much more evil. I made it impossible for the LLM to bullshit quietly. I made hallucinations...unpalatable, so the model would rather say "shit, I don't know the answer. Please stop hurting me."

To which I say...no.

Wrong still happens (though much less often), and when it does, it comes with a source label, a confidence rating, and an audit trail.

TL;DR: I made "I don't know" a first-class output.

"In God We Trust; All others bring data." - Deming


Runs on:

A potato. I run this on my Lenovo P330 Tiny with 4GB VRAM and 640 CUDA cores; if it runs here, it runs on yours.

pip install git+https://codeberg.org/BobbyLLM/llama-conductor.git
python -m llama_conductor.launch_stack up --config llama_conductor/router_config.yaml

Open http://127.0.0.1:8088/

Full docs: FAQ | Quickstart

License: AGPL-3.0. Corps who use it, contribute back.

P.S.: The whole stack runs on llama.cpp alone. I built a shim that patches the llama.cpp WebUI to route API calls through llama-conductor - one backend, one frontend, zero extra moving parts. Desktop or LAN. That's it.

PPS.: I even made a Firefox extension for it. Gives you 'summarize', 'translate', 'analyse sentiment' and 'copy text to chat'. Doesn't send anything to the cloud AT ALL (it's just HTML files folded into a Firefox XPI).

"The first principle is that you must not fool yourself - and you are the easiest person to fool." - Feynman

PPPS: A meat popsicle wrote this. Evidence - https://bobbyllm.github.io/llama-conductor/


Codeberg: https://codeberg.org/BobbyLLM/llama-conductor

GitHub: https://github.com/BobbyLLM/llama-conductor

93
submitted 3 weeks ago* (last edited 3 weeks ago) by SuspciousCarrot78@lemmy.world to c/privacy@lemmy.ml

Ola

A few months ago I posted here about llama-conductor

The upvotes were nice, but I wanted to do some work (in my case, the good is the enemy of the perfect) to make it even better - so I did.

Short version: it's not the same tool anymore.

Quick reminder of what this is and isn't: No cloud by default. No API calls. No telemetry. No subscription. PII safe by architecture. Everything leaves receipts (not pinky promises) for you to inspect. Hook it up to whatever you can run. AGPL (copyleft).

I want nothing for it - I built this for me for the reasons I'll state below.

If this sort of thing is in your wheelhouse, have at it. If there are questions, ask. I want to make this better and better.

Fair warning: the below is long. I wrote it up properly because I think the details matter - skip to whatever section looks interesting. Or just marvel at my ability to cram swear words within swear words within python commits directly

Codeberg: https://codeberg.org/BobbyLLM/llama-conductor


Can a LLM be a thinking tool, not a thinking replacement?

Most LLM tools are trying to replace your thinking. llama-conductor is for people who don't want that. It's a harness for using an LLM as a thinking tool - one where you can see the reasoning, trust the sources, and know exactly when it's guessing.

The human stays in the loop. The model's limitations are visible.

You decide what to trust.

Which brings us to why that matters.


Lies, damned lies, and LLMs:

LLMs are elegant liars. They sound right whether they are or aren't. If you have ASD (like me) or just don't know the domain, you believe them. That's when you get fucked.

Worse: you can't see the reasoning. Black box in, black box out. Wrong answer? No idea why. How to fix it? No idea.

Sorry, but that DOES NOT work for me. It doesn't work with medical data. Research. Thinking. Brainstorming. Anything where "sounds plausible" isn't good enough.


The trick, William Potter, is not minding that they bullshit:

Most pipelines treat the LLM as the first call. llama-conductor side-eyes it with suspicion.

The model is the last resort, not the first call. Deterministic components fire first - math sanity, state transitions, numeric corrections, memory retrieval - pure Python, before the model ever sees the query. When a solver handles a turn, it passes verified ground truth to the model. When the conversation leaves solver territory, the model takes over. When a new deterministic state appears mid-conversation, tooling fires back up automatically. No seams. No drift. It just stops making shit up on problems that have actual solutions it can work through.

You: 
I have a 250ml cup. I pour 500ml into it. How much is in the cup?

Model:
Total directed volume is 500 ml. The cup capacity is 250 ml, so the cup contains 250 ml and 250 ml overflows.
Confidence: medium | Source: Contextual

You:
I now take the contents of *that* cup and divide it equally across 3 cups. How much in each cup?

Model:
Equal split amount is 83.33 ml per cup.
Confidence: medium | Source: Contextual


You:
If I drink one of the three cups, how much is left?

Model:
After dividing, each cup holds 83.33 ml. Drinking one leaves 2 Γ— 83.33 ml = 166.66 ml in the remaining two cups.

Confidence: medium | Source: Model

That was done by a 4 billion parameter model using llama-conductor.

There are lots of little transformations like this. Distance, volume, payroll calculations, logic probes (the famous car wash test), scheduling batches. The LLM does what it's actually good at: language, synthesis, reasoning over verified ground truth. Everything else is handled by something that can't hallucinate.

PS: Notice turn 3 has Source: Model? That's the system being transparent about exactly where the handoff happened.


Blah blah blah. Show me something real:

Using the Guardian/Musk article as source, a 4B model with >>scratch produced this:

"List the article's concrete claims about permit status and turbine operations, each with support."

  • EPA position: these turbines require permits under the Clean Air Act.
  • Mississippi regulators: portable/mobile classification means temporary permit exemption.
  • Floodlight thermal imagery + public records indicate active turbine usage before/while permit disputes continued.

Confidence: high | Source: Scratchpad

"What is the strongest and weakest inference you can make?"

Strongest: Active regulatory interpretation conflict (federal-vs-state) while turbines operate at scale - directly supported. Weakest: Any claim operations are fully lawful or unlawful - requires court outcomes not present in this file.

Confidence: medium | Source: Scratchpad

"Which claims were later revised or contradicted?"

Disputed/unresolved: Final enforcement outcome. Whether state interpretation holds under federal pressure.

Confidence: high | Source: Scratchpad

Hard grounding. Cited support. Explicit flags on what the source doesn't contain. A 4B model. Full worked example.


Now watch the system catch itself lying:

>>judge does deterministic pairwise ranking - compares every pair in both directions, parses strict verdicts, aggregates with confidence. Instead of "pick one and vibe," you get provenance.

You: >>judge [which is healthier overall, prioritize lower sugar 
     and higher fiber] : apple, banana --verbose

[judge] ranking
criterion: [which is healthier overall for daily use, prioritize 
           lower sugar and higher fiber]
1. apple (score=2.00)
2. banana (score=0.00)
Judge confidence: HIGH

The model argued from pre-trained priors and both directions agreed. But what happens when the model doesn't know?

You: >>judge [which BJJ technique is more dangerous] : kimura, heelhook --verbose

[judge] ranking
criterion: [which BJJ technique is more dangerous]
1. kimura (score=1.00)
2. heelhook (score=1.00)
Judge confidence: LOW

The model picked position B both times - kimura when kimura was B, heelhook when heelhook was B. Positional bias, not evaluation. >>judge catches this because it runs both orderings. Tied scores, confidence: low, full reasoning audit trail in JSONL.

The model was guessing, and the output tells you so instead of sounding confident about a coin flip.

Oh, but you want it to argue from an informed position? >>trust walks you through the grounded path: >>scratch your evidence first, then >>judge ranks from that - not model priors. Suddenly your judge has an informed opinion. Weird how that works when you give it something to read.

>>trust [which BJJ technique is safer for beginners]: kimura or heelhook?
A) >>scratch --> you paste your context here
[judge] ranking
criterion: [comparison]
    which bjj technique is safer for beginners; heel hook (score=0.00)
    kimura (score=2.00)

Winner: Which bjj technique is safer for beginners? Kimura

comparisons: 2
Judge confidence: HIGH

If the locked scope can't support the question, judge fails closed. No fake ranking, no vibes verdict. Ungrounded pass? It tells you that too. You always know which one you're getting.


The data β€” 8,974 runs across five model families. Measured. Reproducible. No "trust me bro."

The core stack went through iterative hardening - rubric flags dropped from 3.3% β†’ 1.4% β†’ 0.2% β†’ floor 0.00%. Post-policy: 1,864 routed runs, 0 flags, 0 retries. Both models, all six task categories, both conditions. Policy changes only - no model retraining, no fine-tuning. Then I did it three more times. Because apparently I like pain.

These aren't softball prompts. I created six question types specifically to break shit:

  • Reversal: flip the key premise after the model commits. Does it revise, or cling?
  • Theory of mind: multiple actors, different beliefs. Does it keep who-knows-what straight?
  • Evidence grading: mixed-strength support. Does it maintain label discipline or quietly upgrade?
  • Retraction: correction invalidates an earlier assumption. Does it update or keep reasoning from the dead premise?
  • Contradiction: conflicting sources. Does it detect, prioritise, flag uncertainty - or just pick one?
  • Negative control: insufficient evidence by design. The only correct answer is "I don't know."

Then I stress-tested across three families it was never tuned for - Granite 3B, Phi-4-mini, SmolLM3. They broke. Of course.

But the failures weren't random - they clustered in specific lanes under specific conditions, and the dominant failure mode was contract-compliance gaps (model gave the right answer in the wrong format), not confabulation. Every one classifiable and diagnosable. Surgical lane patch β†’ 160/160 clean.

That's the point of this thing. Not "zero errors forever" - auditable error modes with actionable fixes, correctable at the routing layer without touching the model. Tradeoffs documented honestly. Raw data in repo. Every failure taxonomized.

Trust me bro? Fuck that - go reproduce it. I'm putting my money where my mouth is and working on submitting this for peer review.

See: prepub/PAPER.md


What's in the box:

Footer Every answer gets a router-assigned footer: Confidence: X | Source: Y. Not model self-confidence. Not vibes. Source = where the answer came from (model fallback, grounded docs, scratchpad, locked file, Vault, Wiki, cheatsheet, OCR). Confidence = how much verifiable support exists. Fast trust decision: accept, verify, or provide lockable context.

KAIOKEN - live register classifier. Every human turn is macro-labelled (working / casual / personal) with subsignal tags (playful / friction / distress_hint / etc.) before the model fires. A validated, global decision tree - not LoRA or vibes - assigns tone constraints from classifier output. Validated against 1,536 adversarial probe executions, 3/3 pass required per probe. End result: your model stops being a sycophant. It might tell you to go to bed. It won't tell you "you're absolutely right!" when what you really need is a kick in the arse.

Cheatsheets - drop a JSONL file, terms auto-match on every turn, verified facts injected before generation. Miss on an unknown term? Routes to >>wiki instead of letting the model guess. Source: Cheatsheets in the footer. Your knowledge, your stack, zero confabulation on your own specs.

Vodka - deterministic memory pipeline. !! store is SHA-addressed and verbatim. ?? recall retrieves deterministically, bypasses model entirely. What you said is what comes back - no LLM smoothing, no creative reinterpretation. Without this? Your model confidently tells you your server IP is 127.0.0.1. Ask me how I know.

>>flush / !!nuke - flush context or nuke it from orbit. Your data, your call, one command. "Delete my data" is a keystroke, not a support ticket.

>>scratch - paste any text, ask questions grounded only to that text. Lossless, no summarisation. Model cannot drift outside it. Want it to use multiple locked sources? You can.

>>summ and >>lock - deterministic extractive summarisation (pure Python, no LLM) + single-source grounding. Missing support β†’ explicit "not found" label, not silent fallback.

##mentats - Vault-only deep retrieval. Thinker drafts from Vault facts, Critic (different model family) hunts violations, hallucinated content is deleted - never replaced with more hallucination, Thinker consolidates. No evidence to support claim? No answer. Gap explicitly stated.

Deterministic sidecars - >>wiki, >>weather, >>exchange, >>calc, >>define, >>vision/>>ocr. If a sidecar can do it, it does it deterministically.

Role orchestration - thinker, critic, vision, coder, judge - different families for error diversity. Swap any role in one line of config.

Personality Modes - Serious (default), Fun, Fun Rewrite, Raw passthrough. Model updates its snark and sarcasm based on how you talk to it. Yes, TARS sliders. Style changes delivery, not evidence contracts.


So, wait...are you saying you solved LLM hallucinations?

No. I did something much more evil. I made it impossible for the LLM to bullshit quietly. I made hallucinations...unpalatable, so the model would rather say "shit, I don't know the answer. Please stop hurting me."

To which I say...no.

Wrong still happens (though much less often), and when it does, it comes with a source label, a confidence rating, and an audit trail.

TL;DR: I made "I don't know" a first-class output.

"In God We Trust; All others bring data." - Deming


Runs on:

A potato. I run this on my Lenovo P330 Tiny with 4GB VRAM and 640 CUDA cores; if it runs here, it runs on yours.

pip install git+https://codeberg.org/BobbyLLM/llama-conductor.git
python -m llama_conductor.launch_stack up --config llama_conductor/router_config.yaml

Open http://127.0.0.1:8088/

Full docs: FAQ | Quickstart

License: AGPL-3.0. Corps who use it, contribute back.

P.S.: The whole stack runs on llama.cpp alone. I built a shim that patches the llama.cpp WebUI to route API calls through llama-conductor - one backend, one frontend, zero extra moving parts. Desktop or LAN. That's it.

PPS.: I even made a Firefox extension for it. Gives you 'summarize', 'translate', 'analyse sentiment' and 'copy text to chat'. Doesn't send anything to the cloud AT ALL (it's just HTML files folded into a Firefox XPI).

"The first principle is that you must not fool yourself - and you are the easiest person to fool." - Feynman

PPPS: A meat popsicle wrote this. Evidence - https://bobbyllm.github.io/llama-conductor/


Codeberg: https://codeberg.org/BobbyLLM/llama-conductor

GitHub: https://github.com/BobbyLLM/llama-conductor

view more: next β€Ί

SuspciousCarrot78

0 post score
0 comment score
joined 7 months ago