1
24
Qwen3.6 27B released (huggingface.co)
submitted 8 hours ago* (last edited 8 hours ago) by TheCornCollector@piefed.zip to c/localllama@sh.itjust.works

Recently made a post about the 35b MOE. Now the dense 27b variant has been released.


2
7

Pretty good foot in door video. I'm looking forward to a 27B dense or 35B MoE that can fit entirely within 5-8GB vram 🤞🤞🤞

Speccing out the smallest reasonable rig that might presently run such a thing (used Optiplex with a 3090) is roughly a $1500 ask with a $250/yr running cost. If they can shrink a 27B-35B down to something that can run on low end hardware that sips power....well...that will be an interesting thing indeed.

3
9

Yet more reasons to go local.

4
54

Last year when Framework announced the Framework Desktop I immediately ordered one. I'd been wanting a new gaming PC, but I'd also been kicking around the idea of running a local LLM. When it finally arrived it worked great for gaming… but there wasn't much that would run on the AMD hardware from an LLM standpoint. Over the next few months more tools became available, but it was very slow going. I had many long nights where I'd work and work and work and end up right back where I started.

So I got a Claude Code subscription and used it to help me build out my LLM setup. I made a lot of progress, but now I was comparing my local LLM to Claude, and there was no comparison.

Then I started messing with OpenClaw. First with Claude (expensive, fast), then with my local llama.cpp (cheap, frustrating). I didn't know enough about it, so I used Claude to help me build a custom app around my llama.cpp. That was fun and I learned a lot, but I was spending most of my time chasing bugs instead of actually optimizing anything.

Around that time I heard about Qwen3-Coder-Next, dropped it into llama.cpp, and wow that was a huge step forward. Better direction-following, better tool calls, just better. I felt like my homegrown app was now holding the model back, so I converted over to OpenClaw. Some growing pains, but once things settled I was impressed again.

We built a lot of tooling along the way: a vector database memory system that cleans itself up each night, a filesystem-based context system, speech-to-text and text-to-speech, and a vision model. At this point my local LLM could see me, hear me, speak to me, and remember things about me, and all of it was built to be LLM-agnostic so Claude and my local system could share the same tools.

I was still leaning on Claude heavily for coding, because honestly it's amazing at it. I decided to give Qwen a small test project: build a web-based kanban board: desktop and mobile friendly. It built it… but it sucked. Drag between columns? Broken. Fixed that, now you can't add items. Fixed that, dragging broke on mobile. I kept asking Claude to help troubleshoot and it kept just wanting to rewrite the app. Finally I gave in and said "just fix it" and Claude rewrote the whole thing and it was great. I was disheartened. On top of that, Qwen kept getting into these loops, sometimes running for hours doing nothing productive.

So about a week and a half ago I decided to rethink what I even wanted my local LLM to do. Coding was obviously out. I decided to start fresh and use it to help me journal. A few times a day it reaches out, asks what I'm doing, and if it's relevant, adds an entry to my journal.

I went through a couple more model swaps trying to get it stable, Qwen3.5 was better than Coder-Next for this use case but I was still hitting loop issues. It was consistently prompting me and doing a decent job with the journal, which was at least a step in the right direction.

Then Qwen3.6 dropped. I put the Q6 quant on the same day it released and immediately I could tell it was faster and the output quality was much higher. And I realized earlier today that since I switched to Qwen3.6 I haven't had to ask Claude to check in on Qwen even once. The looping is gone. It's actually following the anti-loop protocols I've been trying to get models to follow for months.

I haven't tried coding with it yet (I don't have high hopes there) but I've given it the ability to create and modify its own skills and it's been doing that beautifully. Scheduled tasks, multiple agents (voice assistant, primary, Home Assistant), all running smoothly.

My reliance on Claude has dropped off sharply since moving to Qwen3.6, and my system resource usage has gone down significantly too. If you've tried to get a local LLM setup running and gave up out of frustration… now might be a good time to jump back in, especially if you know your hardware should be able to handle it.

5
12

I enjoy watching Bijan's reviews. Here's him putting the new Qwen thru the same sort of tests he does with big cloud models.

https://www.youtube.com/watch?v=gVU-DQeqkI0

"Impressive. Most Impressive".

6
10

https://bannedbyanthropic.com/

Stumbled across this today and I have to admit the capriciousness of it put me on my heels.

I guess once you get big enough you can stop having to explain yourself or deal with customer service / resolution.

7
18
8
48

When I first got into local LLMs nearly 3 years ago, in mid 2023, the frontier closed models were ofcourse impressively capable.

I then tried my hand on running 7b size local models, primarily one called Zephyr-7b (what happened to these models?? Dolphin anyone??), on my gaming PC with 8GB AMD RX580 GPU. Fair to say it was just a curiosity exercise (in terms of model performance).

Fast forward to this month, I revisit local LLM. (Although I no longer have the gaming PC, cost-of-living-crisis anyone 😫 )

And, the 31b size models look very sufficient. #Qwen has taken the helm in this order. Which is still very expensive to setup locally, although within grasp.

I'm rooting for the edge-computing models now - the ~2b size models. Due to their low footprint, they are practical to run in a SBC 24/7 at home for many people.

But these edge models are the 'curiosity category' now.

9
5
submitted 3 days ago* (last edited 3 days ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

https://www.youtube.com/watch?v=RAzbzMgVA2A

So I'm not really across image generation very much, but this popped up on my YouTube feed today and I was blown away.

I was wondering if anyone had any idea of the kind of hardware and LLM they may have used to create this. Surely it must be cloud based and not open-weight, local llm?

I'm aware that a few months ago, a new, very capable Qwen image/video/audio generator was released, but I can't imagine what was used for this. Is this the new bytedance or something else? I would be curious in dabbling with it to make something fun for my kiddos.

10
56

32 GB VRAM for less $1k sounds like a steal these days, and I'm sure it's not getting cheaper any time soon.

Does anyone here use this GPU? Or any recent Arc Pros? I basically want someone to talk me out of driving to the nearest place that has it in stock and getting $1k poorer.

11
43
submitted 6 days ago* (last edited 6 days ago) by TheCornCollector@piefed.zip to c/localllama@sh.itjust.works

The Qwen3.5 models are still the best local models I’ve used, so I’m excited to see how this updated version performs.

12
7

Not sure where to ask, are you aware of any realtime voice to voice translators that one can self host? I was doing a bit web search but did not really find anything good. Thanks

13
7
submitted 1 week ago* (last edited 1 week ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

14
20

cross-posted from: https://lemmy.ml/post/45766694

Hey :) For a while now I use gpt-oss-20b on my home lab for lightweight coding tasks and some automation. I'm not so up to date with the current self-hosted LLMs and since the model I'm using was released at the beginning of August 2025 (From an LLM development perspective, it feels like an eternity to me) I just wanted to use the collective wisdom of lemmy to maybe replace my model with something better out there.

Edit:

Specs:

GPU: RTX 3060 (12GB vRAM)

RAM: 64 GB

gpt-oss-20b does not fit into the vRAM completely but it partially offloaded and is reasonably fast (enough for me)

15
19
submitted 1 week ago* (last edited 1 week ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

OAI really trying to push folks to either the ad infested tier for gen chat (ChatGPT Go) or Pro ($100/month) for code.

I guess the $20 Plus plan is the red-headed step child now.

Meanwhile, Z.ai coder is what...$9/month? Hmmm.

I don't have a local rig powerful enough to natively run a good coder (Qwen 3 next), but "I am altering the deal, pray I do not alter it any further" is THE perfect use case for running your shit locally.

PS: It looks like Pro and Plus tiers have gone up a smidge too? Eg: last I checked they were $20 and $100; they now show $30 and $155. I cancelled my account so OAI gifted me a free month (lol) but seriously...fuck that.

16
7
submitted 1 week ago* (last edited 1 week ago) by vermaterc@lemmy.ml to c/localllama@sh.itjust.works

cross-posted from: https://lemmy.ml/post/45705520

The benchmark is a set of handcrafted 2d puzzle games that are easy to solve by humans, but require features like skill acquisition and long-term planning by agents.

17
4

Coding is like popping zits. Sometimes clean and satisfying. Other times...it turns into a dermatology appointment.

https://bobbyllm.github.io/llama-conductor/blog/no-dessert-till-after-dinner/

18
10

Digital balefire. The examples are wild.

19
18
Gemma 4 is here (huggingface.co)
20
34
21
48
22
-1
[Technical] The Great Silicon Shortage (newsletter.semianalysis.com)

This article goes into deeper technical insight into what is consuming all the manufacturing capacity

23
63
submitted 3 weeks ago* (last edited 3 weeks ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

Codeberg: https://codeberg.org/BobbyLLM/llama-conductor


Can a LLM be a thinking tool, not a thinking replacement?

Most LLM tools are trying to replace your thinking. llama-conductor is for people who don't want that. It's a harness for using an LLM as a thinking tool - one where you can see the reasoning, trust the sources, and know exactly when it's guessing.

The human stays in the loop. The model's limitations are visible.

You decide what to trust.

Which brings us to why that matters.


Lies, damned lies, and LLMs:

LLMs are elegant liars. They sound right whether they are or aren't. If you have ASD (like me) or just don't know the domain, you believe them. That's when you get fucked.

Worse: you can't see the reasoning. Black box in, black box out. Wrong answer? No idea why. How to fix it? No idea.

Sorry, but that DOES NOT work for me. It doesn't work with medical data. Research. Thinking. Brainstorming. Anything where "sounds plausible" isn't good enough.


The trick, William Potter, is not minding that they bullshit:

Most pipelines treat the LLM as the first call. llama-conductor side-eyes it with suspicion.

The model is the last resort, not the first call. Deterministic components fire first - math sanity, state transitions, numeric corrections, memory retrieval - pure Python, before the model ever sees the query. When a solver handles a turn, it passes verified ground truth to the model. When the conversation leaves solver territory, the model takes over. When a new deterministic state appears mid-conversation, tooling fires back up automatically. No seams. No drift. It just stops making shit up on problems that have actual solutions it can work through.

You: 
I have a 250ml cup. I pour 500ml into it. How much is in the cup?

Model:
Total directed volume is 500 ml. The cup capacity is 250 ml, so the cup contains 250 ml and 250 ml overflows.
Confidence: medium | Source: Contextual

You:
I now take the contents of *that* cup and divide it equally across 3 cups. How much in each cup?

Model:
Equal split amount is 83.33 ml per cup.
Confidence: medium | Source: Contextual


You:
If I drink one of the three cups, how much is left?

Model:
After dividing, each cup holds 83.33 ml. Drinking one leaves 2 × 83.33 ml = 166.66 ml in the remaining two cups.

Confidence: medium | Source: Model

That was done by a 4 billion parameter model using llama-conductor.

There are lots of little transformations like this. Distance, volume, payroll calculations, logic probes (the famous car wash test), scheduling batches. The LLM does what it's actually good at: language, synthesis, reasoning over verified ground truth. Everything else is handled by something that can't hallucinate.

PS: Notice turn 3 has Source: Model? That's the system being transparent about exactly where the handoff happened.


Blah blah blah. Show me something real:

Using the Guardian/Musk article as source, a 4B model with >>scratch produced this:

"List the article's concrete claims about permit status and turbine operations, each with support."

  • EPA position: these turbines require permits under the Clean Air Act.
  • Mississippi regulators: portable/mobile classification means temporary permit exemption.
  • Floodlight thermal imagery + public records indicate active turbine usage before/while permit disputes continued.

Confidence: high | Source: Scratchpad

"What is the strongest and weakest inference you can make?"

Strongest: Active regulatory interpretation conflict (federal-vs-state) while turbines operate at scale - directly supported. Weakest: Any claim operations are fully lawful or unlawful - requires court outcomes not present in this file.

Confidence: medium | Source: Scratchpad

"Which claims were later revised or contradicted?"

Disputed/unresolved: Final enforcement outcome. Whether state interpretation holds under federal pressure.

Confidence: high | Source: Scratchpad

Hard grounding. Cited support. Explicit flags on what the source doesn't contain. A 4B model. Full worked example.


Now watch the system catch itself lying:

>>judge does deterministic pairwise ranking - compares every pair in both directions, parses strict verdicts, aggregates with confidence. Instead of "pick one and vibe," you get provenance.

You: >>judge [which is healthier overall, prioritize lower sugar 
     and higher fiber] : apple, banana --verbose

[judge] ranking
criterion: [which is healthier overall for daily use, prioritize 
           lower sugar and higher fiber]
1. apple (score=2.00)
2. banana (score=0.00)
Judge confidence: HIGH

The model argued from pre-trained priors and both directions agreed. But what happens when the model doesn't know?

You: >>judge [which BJJ technique is more dangerous] : kimura, heelhook --verbose

[judge] ranking
criterion: [which BJJ technique is more dangerous]
1. kimura (score=1.00)
2. heelhook (score=1.00)
Judge confidence: LOW

The model picked position B both times - kimura when kimura was B, heelhook when heelhook was B. Positional bias, not evaluation. >>judge catches this because it runs both orderings. Tied scores, confidence: low, full reasoning audit trail in JSONL.

The model was guessing, and the output tells you so instead of sounding confident about a coin flip.

Oh, but you want it to argue from an informed position? >>trust walks you through the grounded path: >>scratch your evidence first, then >>judge ranks from that - not model priors. Suddenly your judge has an informed opinion. Weird how that works when you give it something to read.

>>trust [which BJJ technique is safer for beginners]: kimura or heelhook?
A) >>scratch --> you paste your context here
[judge] ranking
criterion: [comparison]
    which bjj technique is safer for beginners; heel hook (score=0.00)
    kimura (score=2.00)

Winner: Which bjj technique is safer for beginners? Kimura

comparisons: 2
Judge confidence: HIGH

If the locked scope can't support the question, judge fails closed. No fake ranking, no vibes verdict. Ungrounded pass? It tells you that too. You always know which one you're getting.


The data — 8,974 runs across five model families. Measured. Reproducible. No "trust me bro."

The core stack went through iterative hardening - rubric flags dropped from 3.3% → 1.4% → 0.2% → floor 0.00%. Post-policy: 1,864 routed runs, 0 flags, 0 retries. Both models, all six task categories, both conditions. Policy changes only - no model retraining, no fine-tuning. Then I did it three more times. Because apparently I like pain.

These aren't softball prompts. I created six question types specifically to break shit:

  • Reversal: flip the key premise after the model commits. Does it revise, or cling?
  • Theory of mind: multiple actors, different beliefs. Does it keep who-knows-what straight?
  • Evidence grading: mixed-strength support. Does it maintain label discipline or quietly upgrade?
  • Retraction: correction invalidates an earlier assumption. Does it update or keep reasoning from the dead premise?
  • Contradiction: conflicting sources. Does it detect, prioritise, flag uncertainty - or just pick one?
  • Negative control: insufficient evidence by design. The only correct answer is "I don't know."

Then I stress-tested across three families it was never tuned for - Granite 3B, Phi-4-mini, SmolLM3. They broke. Of course.

But the failures weren't random - they clustered in specific lanes under specific conditions, and the dominant failure mode was contract-compliance gaps (model gave the right answer in the wrong format), not confabulation. Every one classifiable and diagnosable. Surgical lane patch → 160/160 clean.

That's the point of this thing. Not "zero errors forever" - auditable error modes with actionable fixes, correctable at the routing layer without touching the model. Tradeoffs documented honestly. Raw data in repo. Every failure taxonomized.

Trust me bro? Fuck that - go reproduce it. I'm putting my money where my mouth is and working on submitting this for peer review.

See: prepub/PAPER.md


What's in the box:

Footer Every answer gets a router-assigned footer: Confidence: X | Source: Y. Not model self-confidence. Not vibes. Source = where the answer came from (model fallback, grounded docs, scratchpad, locked file, Vault, Wiki, cheatsheet, OCR). Confidence = how much verifiable support exists. Fast trust decision: accept, verify, or provide lockable context.

KAIOKEN - live register classifier. Every human turn is macro-labelled (working / casual / personal) with subsignal tags (playful / friction / distress_hint / etc.) before the model fires. A validated, global decision tree - not LoRA or vibes - assigns tone constraints from classifier output. Validated against 1,536 adversarial probe executions, 3/3 pass required per probe. End result: your model stops being a sycophant. It might tell you to go to bed. It won't tell you "you're absolutely right!" when what you really need is a kick in the arse.

Cheatsheets - drop a JSONL file, terms auto-match on every turn, verified facts injected before generation. Miss on an unknown term? Routes to >>wiki instead of letting the model guess. Source: Cheatsheets in the footer. Your knowledge, your stack, zero confabulation on your own specs.

Vodka - deterministic memory pipeline. !! store is SHA-addressed and verbatim. ?? recall retrieves deterministically, bypasses model entirely. What you said is what comes back - no LLM smoothing, no creative reinterpretation. Without this? Your model confidently tells you your server IP is 127.0.0.1. Ask me how I know.

>>flush / !!nuke - flush context or nuke it from orbit. Your data, your call, one command. "Delete my data" is a keystroke, not a support ticket.

>>scratch - paste any text, ask questions grounded only to that text. Lossless, no summarisation. Model cannot drift outside it. Want it to use multiple locked sources? You can.

>>summ and >>lock - deterministic extractive summarisation (pure Python, no LLM) + single-source grounding. Missing support → explicit "not found" label, not silent fallback.

##mentats - Vault-only deep retrieval. Thinker drafts from Vault facts, Critic (different model family) hunts violations, hallucinated content is deleted - never replaced with more hallucination, Thinker consolidates. No evidence to support claim? No answer. Gap explicitly stated.

Deterministic sidecars - >>wiki, >>weather, >>exchange, >>calc, >>define, >>vision/>>ocr. If a sidecar can do it, it does it deterministically.

Role orchestration - thinker, critic, vision, coder, judge - different families for error diversity. Swap any role in one line of config.

Personality Modes - Serious (default), Fun, Fun Rewrite, Raw passthrough. Model updates its snark and sarcasm based on how you talk to it. Yes, TARS sliders. Style changes delivery, not evidence contracts.


So, wait...are you saying you solved LLM hallucinations?

No. I did something much more evil. I made it impossible for the LLM to bullshit quietly. I made hallucinations...unpalatable, so the model would rather say "shit, I don't know the answer. Please stop hurting me."

To which I say...no.

Wrong still happens (though much less often), and when it does, it comes with a source label, a confidence rating, and an audit trail.

TL;DR: I made "I don't know" a first-class output.

"In God We Trust; All others bring data." - Deming


Runs on:

A potato. I run this on my Lenovo P330 Tiny with 4GB VRAM and 640 CUDA cores; if it runs here, it runs on yours.

pip install git+https://codeberg.org/BobbyLLM/llama-conductor.git
python -m llama_conductor.launch_stack up --config llama_conductor/router_config.yaml

Open http://127.0.0.1:8088/

Full docs: FAQ | Quickstart

License: AGPL-3.0. Corps who use it, contribute back.

P.S.: The whole stack runs on llama.cpp alone. I built a shim that patches the llama.cpp WebUI to route API calls through llama-conductor - one backend, one frontend, zero extra moving parts. Desktop or LAN. That's it.

PPS.: I even made a Firefox extension for it. Gives you 'summarize', 'translate', 'analyse sentiment' and 'copy text to chat'. Doesn't send anything to the cloud AT ALL (it's just HTML files folded into a Firefox XPI).

"The first principle is that you must not fool yourself - and you are the easiest person to fool." - Feynman

PPPS: A meat popsicle wrote this. Evidence - https://bobbyllm.github.io/llama-conductor/


Codeberg: https://codeberg.org/BobbyLLM/llama-conductor

GitHub: https://github.com/BobbyLLM/llama-conductor

24
43
25
9
Clanker Adjacent (my blog) (bobbyllm.github.io)
submitted 1 month ago* (last edited 1 month ago) by SuspciousCarrot78@lemmy.world to c/localllama@sh.itjust.works

Ola

Elsewhere, I've been building a behaviour shaping harness for local LLMs. In the process of that, I thought "well, why not share what the voices inside my head are saying".

With that energy in mind, may I present Clanker Adjacent (name chosen because apparently I sound like a clanker - thanks lemmy! https://lemmy.world/post/43503268/22321124)

I'm going for long form, conversational tone on LLM nerd-core topics; or at least the ones that float my boat. If that's something that interests you, cool. If not, cool.

PS: I promise the next post will be "Show me your 80085".

PPS: Not a drive by. I lurk here and get the shit kicked out of me over on /c/technology

view more: next ›

LocalLLaMA

4604 readers
86 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS