Another reason to self host your own AI (aussie.zone)

submitted 1 month ago* (last edited 1 month ago) by SuspiciousCarrot78@aussie.zone to c/selfhosted@lemmy.world

50 comments fedilink hide all child comments

Recent post re: AI as utility

https://www.tomsguide.com/ai/people-will-buy-intelligence-from-us-on-a-meter-chatgpts-ceo-sam-altman-has-critics-worried-with-his-ai-vision

Myself, I'm a fan of local LLM / self hosted ML.... but if you ever needed a clarion call that a hard pivot is coming (soon) for online/ cloud based AI...Altman et al are making some concerning mouth noises (to say nothing of broader concerns with OAI, Anthropic etc).

Right now, I'm sketching out a plan where my Raspberry Pi (always on, 2-3w) uses a magic packet to wake up my modest AI server (Lenovo P330 with Tesla P4) if/when needed (Qwen 3.6-35B-A3B); no point in chugging down 80-100w, 24/7 for no good reason.

If the trend continues the direction it appears to be (increasing costs, environmental impacts etc) then I'd feel a lot better hosting my own as port of first call and replacing simpler tasks with more traditional programs. YMMV.

top 50 comments

sorted by: hot top new old

[-] brucethemoose@lemmy.world 57 points 1 month ago* (last edited 1 month ago)

Yeah.

It’s not even about efficiency, really, but independence from corporations, privacy, and principle. Kind of like Lemmy.

[-] irmadlad@lemmy.world 23 points 1 month ago

People will buy intelligence from us on a meter'

We have governmental surveillance and we have surveillance capitalism. Surveillance capitalism works so well that governments are now very interested in the data they collect, which is alarming. Unfounded conspiracy theory: It's probably one of the reasons that governments don't seem interested in AI's regulation. If I had the proper equipment to run AI entirely local and efficiently so that the expenditure would justify it, I would.

[-] SuspiciousCarrot78@aussie.zone 5 points 1 month ago

You probably could. A Tesla P4 or P40 (old data centre cards) are more than up to the job. My Lenovo tiny hosts a P4 (card cost $100 on eBay; the lenovo itself was $200ish) and runs Qwen3.5-35B-A3B at about 20 tok/s. Smaller models are even faster.

https://www.youtube.com/watch?v=8F_5pdcD3HY

If you're not bound by the one liter shoebox design, then the P40 is still a great and inexpensive card.

I think I mentioned elsewhere but right now I'm trying to figure out if I can use a magic packet from the Raspberry Pi to wake up the Lenovo as needed rather than leaving it on all the time.

[-] klangcola@reddthat.com 3 points 1 month ago* (last edited 1 month ago)

If you're already using node-red, the Wake On Lan node works well, and with node-red it's easy to trigger the magic packet based on whatever trigger condition you want.

The only limitation I know is WOL doesn't work after a power outage, because the switch and RPI doesn't know where to find the target machine

Thanks for the tips on reusable enterprise cards btw

[-] WhyJiffie@sh.itjust.works 2 points 1 month ago* (last edited 1 month ago)

The only limitation I know is WOL doesn't work after a power outage, because the switch and RPI doesn't know where to find the target machine

maybe, but the pi does not need to know that, only the mac address and the interface. the switch doesn't need to know either because it's a broadcast frame, it's forwarded to all cables. the problem sometimes is that if you configure WOL from linux, the network adapter will probably forget on power cycling that it is supposed to react to magic packets. I think not all hardware is susceptible to that, but even then it could help to configure WOL in the BIOS

@SuspiciousCarrot78@aussie.zone

[-] klangcola@reddthat.com 1 points 1 month ago

Maybe something else going on then, but ive never gotten WOL to work after a blackout when there's two switches between sender and receiver. After powering up the receiver once, WOL works again

[-] WhyJiffie@sh.itjust.works 2 points 1 month ago

that's probably the BIOS only loading the configuration on the first boot. you could try enabling fast boot or disabling the right energy saving settings in the BIOS and see if that fixes it.

[-] homik@slrpnk.net 1 points 1 month ago

Switches probably need to figure out which way a particular MAC is (unlike a hub, which just express everywhere). That's the switching part. If they power off, the tables will be empty.

[-] klangcola@reddthat.com 2 points 1 month ago

Yeah that was my assumption. But I hadn't considered WOL being broadcast, so now I'm not so sure. I would assume it's broadcast on both IP and Ethernet layer. It's time to do some wiresharking :)

[-] homik@slrpnk.net 2 points 1 month ago

I don't think WoL works over IP. In my mind it's lower (LAN, e.g. ethernet) level. But if it used IP, you'd need to get ARP going before it routes. An "offline" network chip could probably manage that, though.

I'm curious to know what you find. Wireshark is always fun and fun and enlightening. :)

[-] SuspiciousCarrot78@aussie.zone 0 points 1 month ago* (last edited 1 month ago)

Good tips - thanks!

PS: sad to report the 24GB Tesla p40s are now around $250 USD on eBay, so not quite as cheap as I remembered. P4s are still cheap tho, though frankly if you're going that end of town, a 1080 is about on par, less fussy and probably cheaper - it just won't fit in a uSFF.

[-] irmadlad@lemmy.world 2 points 1 month ago

Thing is, if I were going to do in house AI, I'd want to do it up right and from what I can gather, a system like that is going to cost me some jack.

[-] pogmommy@lemmy.ml 10 points 1 month ago

My issue with the orphan-crushing machine isn't only that it's not in my children's bedroom

[-] sobchak@programming.dev 9 points 1 month ago

I think they know it's a somewhat viable option and is part of the reason they're doing the hardware cartel/circlejerk thing.

[-] noxypaws@pawb.social 8 points 1 month ago

not gonna self host bullshit that wastes resources and makes me dumber.

[-] toor@lemmy.world 52 points 1 month ago

Me, looking at my Jellyfin server…

Oh. Ok.

[-] noxypaws@pawb.social 10 points 1 month ago

NO that makes you dumber in a GOOD WAY THO.

[-] Auli@lemmy.ca 5 points 1 month ago

Sure but all these self hosted ais are still done by companies who used massive amounts of power and water to train it.

[-] KatherinaReichelt@feddit.org 18 points 1 month ago

Which is an interesting dilemma: Those AIs are already trained. That power and water was used. If you use them, you will not pollute anything. But you may encourage those companies to train another AI

[-] brucethemoose@lemmy.world 17 points 1 month ago* (last edited 1 month ago)

No.

Even the biggest open weights models are trained on pennies compared to OpenAI and Claude. They just don’t have the hardware to be so wasteful.

In fact, the Nvidia GPU ban was the best thing to ever happen to “small” AI devs. It made them thrifty.

[-] superglue@lemmy.dbzer0.com 5 points 1 month ago

Does anyone have a recommendation for a local model that can run well on a 5070 12GB? It pretty much would only get used for help with homelabbing and simple scripts.

[-] monoboy@lemmy.zip 7 points 1 month ago

Qwen 3.6-35B-A3B (which OP mentioned) would work great as long as you have some system RAM to offload it.

[-] SuspiciousCarrot78@aussie.zone 6 points 1 month ago

There's an argument to be had regarding a MoE versus a small dense model. I guess it depends on what exactly you need doing with it. I would be tempted to run a smaller dense model (like a Qwen 3-14B or a Qwen 3.5 9B) as at a reasonable quant, it might fit mostly or entirely on the GPU, thereby giving you excellent speeds.

PS: I'm actually in the process of designing an expert system (not a LLM) for pretty much the task you described. The intention is that you would still interact with it like a large language model, but the actual brains underneath it would be something more traditional.

[-] brucethemoose@lemmy.world 1 points 1 month ago* (last edited 1 month ago)

MoEs can be very fast with hybrid inference. I run Xiaomi Mimo 2.5 (a 310B model, 116GB weights) on my single 3090 + 7800 CPU, and it outputs faster than I can read it.

It's also easier to fit long context, if you need that.

It's best to use the ik_llama.cpp fork for that, though. It gives a huge boost to hybrid MoE speeds.

[-] brucethemoose@lemmy.world 4 points 1 month ago* (last edited 1 month ago)

Depends on how much CPU RAM you have, and how fast it is.

As others said, Qwen 35B at the very least. But you can get better models with more CPU RAM.

[-] superglue@lemmy.dbzer0.com 1 points 1 month ago

Ive got 32GB DDR5 6000mhz

[-] brucethemoose@lemmy.world 2 points 1 month ago* (last edited 1 month ago)

Probably Qwen 35B then. ~9GB free VRAM + (let's say) ~16GB of free CPU RAM is a good size for that, and squeezing bigger models in would be hard unless it's a headless linux server.

[-] commander@lemmy.world 5 points 1 month ago

Altman can try to hype up how everyones going to subscribe to them someday all the while their subscriber base is being eaten up by competitors.

https://www.wheresyoured.at/openai-projects-chatgpt-plus-subscriptions-to-drop-by-80-from-44-million-in-2025-to-9-million-in-2026-made-up-using-cheaper-subscriptions-somehow/

Local stuff. I still believe the small parameter, ~1B free local, ones will suffice for the vast majority of how people use LLMs and there's still going to be a few years of improvements there until investments dry up. Eventually I bet more and more phone companies will include one of these small ones out the box. Pretty much like a nice search engine that works offline like if you're out on a major hike. Cloud stuff, there'll be stuff like Proton's Lumo where they're taking free open weight stuff and piecing them together for users.

OpenAI's thing is they'll make up for falling subscribers with advertising. So pretty much we're advancing fast in the search engine race of the 90s/early aughts. We'll at least have Gemini. ChatGPT maybe ends up crashes in value someday and bought up by Microsoft or some other company. Deepseek, Qwen, Kimi. Claude like ChatGPT maybe survices or crashes and gets adsorbed by another company. Proton continue to exist as the company making AI products out of free stuff. Eventually the pace of improvements moves at a crawl and it's pointless to be paying for the best paywalled stuff. Just use the free stuff like how everyone mostly uses free search engines

[-] SuspiciousCarrot78@aussie.zone 4 points 1 month ago* (last edited 1 month ago)

Agree. And re small models - very agree. In fact I made a ablated version of Qwen 3.5-2B for use with my pi, before thinking a bit harder and realising I can probably code something bespoke that doesn't need a stochastic parrot as a squwake box at all.

https://huggingface.co/BobbyLLM/polaris-heretic-Q4_K_M-GGUF

Still, as a SLM, it's perfectly cromulent and does well with tool calling etc which is what I wanted it for.

[-] somegeek@programming.dev 3 points 1 month ago* (last edited 1 month ago)

I started working toward self hosting LLM for my small company using ollama and opencode as agent But I realized a good model like GLM 5 requures 250GB of RAM and 24GB vram with a 4080?? I dont know, this is what the LLM told me itself.

I ended up using qwen-code2.7-7b-16k.

Currently the best thing I have is my laptop, 16GB ram, i7 9750H gtx1650

How do you guys selfhost? What models do you use that are actually good?

[-] SuspiciousCarrot78@aussie.zone 3 points 1 month ago* (last edited 1 month ago)

I mean...that entirely depends on your use case - and I hate saying that. For me and what I do, Qwen SLM (esp Qwen3-4B 2507 instruct and Qwen3.5-2B) are exceptional. But I'm not trying to do Claude at home.

Best bet? Spend $10 on OpenRouter and try different models. In a head to head with ChatGPT 5.4 mini (excellent for coding BTW), I've found Qwen 3.5 27B more than able to hold its own for coding tasks...IF you narrowly gate it/confine it. The last batch of Qwen's really are something. Dunno about the 3.7 series.

Having said ALL that, I'm really tempted to go back in time and code myself a deterministic expert system, with user updatable knowledge cascade, tool calling and a minimal amount of Markov chain word garnish for flavour. I think we use to just call that "a program" lol.

Really tempted actually, because if 50% of llm use case is basically Super Google but not shit...well, I can make that myself. I just need to point my autism at it.

PS: this might help

https://www.youtube.com/watch?v=0AqpaFm11oI

[-] somegeek@programming.dev 1 points 1 month ago

Qwen 3.5 24B is way too large for my specs. I'm barely running qwen2.5 7B

[-] SuspiciousCarrot78@aussie.zone 3 points 1 month ago* (last edited 1 month ago)

Hmm....it runs on a 1060...it's a MoE not a dense. 24B is even lighter. Worth a shot.

https://www.youtube.com/watch?v=8F_5pdcD3HY

Else, if youre looking for a coding model (??) something like Sara or fara might suit

https://huggingface.co/microsoft/Fara-7B

[-] somegeek@programming.dev 1 points 1 month ago

Thanks. I will look into it.

[-] sturmblast@lemmy.world 3 points 1 month ago

P100s are dirt cheap on ebay fyi

[-] brucethemoose@lemmy.world 3 points 1 month ago* (last edited 1 month ago)

In practice, they’re not very good because of broken FP16, broken kernels, high idle usage and a bunch of other things.

Same with the AMD MI50 and MI100. Looks great on paper, not practical IRL, unless you want to pay a whole team of software devs to fix them for you.

Better to just save up for a 2080 TI or 3090, sadly.

[-] sturmblast@lemmy.world 1 points 1 month ago

Not having issues

[-] SuspiciousCarrot78@aussie.zone 2 points 1 month ago* (last edited 1 month ago)

Huh - cheaper than the P40s (though less VRAM) but larger bandwidth due to HBM2. Good looking out

[-] sturmblast@lemmy.world 1 points 1 month ago

They rip

[-] surewhynotlem@lemmy.world 1 points 1 month ago

I was looking at that. Does it end up faster than something like a 1080?

[-] SuspiciousCarrot78@aussie.zone 2 points 1 month ago* (last edited 1 month ago)

Numbers about 3-4x. The P100 is near 800 GB/s. The 1080 is what... 192GB/s? Hell, even if it were double that, HBM2 simply has larger bandwidth. The 1080 was a gaming card; the P100 is a server / number cruncher.

[-] Decronym@lemmy.decronym.xyz 3 points 1 month ago* (last edited 1 month ago)

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:

Fewer Letters	More Letters
ARP	Address Resolution Protocol, translates IPs to MAC addresses
IP	Internet Protocol
RPi	Raspberry Pi brand of SBC
SBC	Single-Board Computer

[Thread #321 for this comm, first seen 30th May 2026, 09:50] [FAQ] [Full list] [Contact] [Source code]

[-] heartSagan5@lemmy.zip 1 points 1 month ago

And are you sure you’r self-hosting or is it a plugin (that you’re self-hosting)? Also, I don’t invite SkyNet into my perimeter.

[-] GreenKnight23@lemmy.world 0 points 1 month ago

if you're selfhosting AI, make sure you at least firewall it off from the internet. many providers still send metrics back home that includes usage and content.

[-] SuspiciousCarrot78@aussie.zone 4 points 1 month ago* (last edited 1 month ago)

Respectfully, that's not really how local LLMs work.

A GGUF model sitting on my hard drive has no ability to "send content back home" any more than a PDF or a JPEG does. If you're running something like llama.cpp or Ollama entirely locally, the model weights are just data files.

The real privacy concerns are cloud APIs, telemetry in front-ends, browser extensions, analytics, update services, or accidentally exposing a service to the public internet.

"Self-hosted AI" isn't one thing. There's a huge difference between:

Running ChatGPT through an API
Running a commercial AI appliance
Running a local Qwen/Mistral/Llama model on your own hardware

Firewalling internet-facing services is good advice. Assuming every local model is secretly uploading prompts is not.

EDIT: for the record, I didn't down vote you - that was someone else.

[-] Hiro8811@lemmy.world -2 points 1 month ago* (last edited 1 month ago)

You're still paying for electricity and a big part of the world is in a electricity crisis. "AI" has few real uses and LLMs are not one of them.

[-] brucethemoose@lemmy.world 22 points 1 month ago* (last edited 1 month ago)

This is a “feel guilty about missing recycling” kind of complaint.

Having a server run for an hour or two (?) a day is negligible. You use more energy running a fridge, or leaving a few lights on, or browsing Lemmy for a while. Or running a docker container for other services. You release more greenhouse gasses eating beef, or driving anywhere, or even opening your front door a few times, and individual industries are going to use vastly more electricity than a few self hosters ever would. If you own an EV, you’ve probably blown out your entire zip code of self hosters.

…But if it still bothers you, you can find an ewaste smartphone(s) and host on that. This is actually a very neat use case IMO.

However, if you get to the homelab scale of “an EPYC + 3090s running all the time” that electricity use does start to add up. But that’s quite a rare hobbyist tier, I’d say, and it really shouldnt be running 24/7.

this post was submitted on 28 May 2026

97 points (76.8% liked)

Selfhosted

60861 readers

847 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Detailed Rules Post

Be civil.
No spam.
Posts are to be related to self-hosting.
Don't duplicate the full text of your blog or readme if you're providing a link.
Submission headline should match the article title.
No trolling.
Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details. Tags [CBH] or [AIP] are required, see the links in Rule 8 for details.
AI-related discussions and AI-involved promotional posts have additional requirements for tagging, as noted in Rule 7 and the AI & Promotional Post Expanded Rules post, and find example disclosures here.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago

MODERATORS

curbstickle@anarchist.nexus

curbstickle_lw@lemmy.world