1
9
submitted 11 hours ago by [email protected] to c/[email protected]

What’s new in Le Chat.

Deep Research mode: Lightning fast, structured research reports on even the most complex topics.

Voice mode: Talk to Le Chat instead of typing with our new Voxtral model.

Natively multilingual reasoning: Tap into thoughtful answers, powered by our reasoning model — Magistral.

Projects: Organize your conversations into context-rich folders.

Advanced image editing directly in Le Chat, in partnership with Black Forest Labs.

2
19
submitted 1 day ago by [email protected] to c/[email protected]
3
17
submitted 1 day ago by [email protected] to c/[email protected]
4
25
submitted 2 days ago by [email protected] to c/[email protected]
5
12
submitted 3 days ago by [email protected] to c/[email protected]
6
6
submitted 3 days ago by [email protected] to c/[email protected]

cross-posted from: https://ani.social/post/16779655

GPU VRAM Price (€) Bandwidth (TB/s) TFLOP16 €/GB €/TB/s €/TFLOP16
NVIDIA H200 NVL 141GB 36284 4.89 1671 257 7423 21
NVIDIA RTX PRO 6000 Blackwell 96GB 8450 1.79 126.0 88 4720 67
NVIDIA RTX 5090 32GB 2299 1.79 104.8 71 1284 22
AMD RADEON 9070XT 16GB 665 0.6446 97.32 41 1031 7
AMD RADEON 9070 16GB 619 0.6446 72.25 38 960 8.5
AMD RADEON 9060XT 16GB 382 0.3223 51.28 23 1186 7.45

This post is part "hear me out" and part asking for advice.

Looking at the table above AI gpus are a pure scam, and it would make much more sense to (atleast looking at this) to use gaming gpus instead, either trough a frankenstein of pcie switches or high bandwith network.

so my question is if somebody has build a similar setup and what their experience has been. And what the expected overhead performance hit is and if it can be made up for by having just way more raw peformance for the same price.

7
31
submitted 6 days ago by [email protected] to c/[email protected]

In brief

  • In late summer 2025, a publicly developed large language model (LLM) will be released — co-created by researchers at EPFL, ETH Zurich, and the Swiss National Supercomputing Centre (CSCS).

  • This LLM will be fully open: This openness is designed to support broad adoption and foster innovation across science, society, and industry.

  • A defining feature of the model is its multilingual fluency in over 1,000 languages.

8
2
submitted 4 days ago by [email protected] to c/[email protected]
9
30
submitted 1 week ago* (last edited 1 week ago) by [email protected] to c/[email protected]

Recently I've been experimenting with Claude and feeling the burn on the premium API usage. I wanted to know how much cheaper my local llm was in terms of cost-per-token output.

Claude Sonnet is a good reference with 15$ per 1 million tokens out, so I wanted to know comparatively how many tokens 15$ worth electricity powering my rig would generate.

(These calculations are just simple raw token generation by the way, in real world theres cost in initial hardware, ongoing maintenance as parts fail, and human time to setup thats much harder to factor into the equation)

So how does one even calculate such a thing? Well, you need to know

  1. how many watts your inference rig consumes at load
  2. how many tokens on average it can generate per second while inferencing (with context relatively filled up, we want conservative estimates)
  3. cost of electric you pay on the utility bill in kilowatts-per-hour

Once you have those constants you can extrapolate how many kilowatt-hours worth of runtime 15$ in electric buys then figure out the total amount of tokens you would expect to generate over that time given the TPS.

The numbers shown in the screenshot are for a fully loaded into vram model on the ol' 1070ti 8gb. But even with partially offloaded numbers for 22-32b models at 1-3tps its still a better deal overall.

I plan to offer the calculator as a tool on my site and release it under a permissive license like gpl if anyone is interested.

10
9
submitted 1 week ago* (last edited 1 week ago) by [email protected] to c/[email protected]

I have an unused dell optiplex 7010 i wanted to use as a base for an interference rig.

My idea was to get a 3060, a pci riser and 500w power supply just for the gpu. Mechanically speaking i had the idea of making a backpack of sorts on the side panel, to fit both the gpu and the extra power supply since unfortunately it's an sff machine.

What's making me weary of going through is the specs of the 7010 itself: it's a ddr3 system with a 3rd gen i7-3770. I have the feeling that as soon as it ends up offloading some of the model into system ram is going to slow down to a crawl. (Using koboldcpp, if that matters.)

Do you think it's even worth going through?

Edit: i may have found a thinkcenter that uses ddr4 and that i can buy if i manage to sell the 7010. Though i still don't know if it will be good enough.

11
15
submitted 1 week ago by [email protected] to c/[email protected]

Tencent recently released a new MoE model with ~80b parameters, 13b of which are active at inference. Seems very promising for people with access to 64 gigs of VRAM.

12
31
submitted 3 weeks ago by [email protected] to c/[email protected]
13
28
Homelab upgrade WIP (lemmy.world)
submitted 3 weeks ago* (last edited 3 weeks ago) by [email protected] to c/[email protected]

Theres a lot more to this stuff than I thought there would be when starting out. I spent the day familiarizing with how to take apart my pc and swap gpus .Trying to piece everything together.

Apparently in order for PC to startup right it needs a graphical driver. I thought the existance of a HDMI port on the motherboard implied the existance of onboard graphics but apparently only special CPUs have that capability. My ryzen 5 2600 doesnt. The p100 Tesla does not have graphical display capabilities. So ive hit a snag where the PC isnt starting up due to not finding a graphical interface output.

I'm going to try to run multiple GPU cards together on pcie. Hope I can mix amd Rx 580 and nvidia tesla on same board fingers crossed please work.

My motherboard thankfully supports 4x4x4x4 pcie x16 bifurcation which isa very lucky break I didnt know going into this 🙏

Strangely other configs for splitting 16x lanes like 8x8 or 8x4x4 arent in my bios for some reason? So I'm planning to get a 4x bifurcstion board and plug both cards in and hope that the amd one is recognized!

According to one source The performance loss for using 4x lanes for GPUs doing the compute i m doing is 10-15 % surprisingly tolerable actually.

I never really had to think about how pcie lanes work or how to allocate them properly before.

For now I'm using two power supplies one built into the desktop and the new 850e corsair psu. I choose this one as it should work with 2-3 GPUs while being in my price range.

Also the new 12v-2x6 port supports like 600w enough for the tesla and comes with a dual pcie split which was required for the power cable adapter for Tesla. so it all worked out nicely for a clean wire solution.

Sadly I fucked up a little. The pcie release press plastic thing on the motherboard was brittle and I fat thumbed it too hard while having problems removing the GPU initially so it snapped off. I dont know if that's something fixable. It doesnt seem to affect the security of the connection too bad fortunately. I intend to grt a pcie riser extensions cable so there won't be much force on the now slightly loosened pcieconnection. Ill have the gpu and bifurcation board layed out nicely on the homelab table while testing, get them mounted somewhere nicely once I get it all working.

I need to figure out a external GPU mount system. I see people use server racks or nut and bolt meta chassis. I could get a thin plate of copper the size of the desktops glass window as a base/heatsink?

14
24
submitted 3 weeks ago by [email protected] to c/[email protected]

Hey fellow llama enthusiasts! Great to see that not all of lemmy is AI sceptical.

I'm in the process of upgrading my server with a bunch of GPUs. I'm really excited about the new Mistral / Magistral Small 3.2 models and would love to serve them for me and a couple of friends. My research led me to vLLM with which I was able to double inference speed compared to ollama at least for qwen3-32b-awq.

Now sadly, the most common quantization methods (GGUF, EXL, BNB) are either not fully (GGUF) or not at all (EXL) supported in vLLM, or multi-gpu inference thouth tensor parallelism is not supported (BNB). And especially for new models it's hard to find pre-quantized models in different, more broadly supported formats (AWQ, GPTQ).

Does any of you guys face a similar problem? Do you quantize models yourself? Are there any up-to-date guides you would recommend? Or did I completely overlook another, obvious solution?

It feels like when I've researched something yesterday, it's already outdated again today, since the landscape is so rapidly evolving.

Anyways, thank you for reading and sharing your thoughts or experience if you feel like it.

15
17
submitted 3 weeks ago* (last edited 3 weeks ago) by [email protected] to c/[email protected]

I've recently been writing fiction and using an AI as a critic/editor to help me tighten things up (as I'm not a particularly skilled prose writer myself). Currently the two ways I've been trying are just writing text in a basic editor and then either saving files to add to a hosted LLM or copy pasting into a local one. Or using pycharm and AI integration plugins for it.

Neither is particularly satisfactory and I'm wondering if anyone knows of a good setup for this (preferably open source but not neccesary), integration with at least one of ollama or open-router would be needed.

Edit: Thanks for the recommendations everyone, lots of things for me to check out when I get the time!

16
26
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

I'm looking to locally generate voiceovers from text and also try to generate audiobooks. Does anyone have experience with sherpa-onnx? There also appear to be two separate frontends for Kokoro specifically dedicated for audiobook creation, but they appear to both be abandoned. Or am I barking up the completely wrong tree?
Thanks!

17
34
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

It seems mistral finally released their own version of a small 3.1 2503 with CoT reasoning pattern embedding. Before this the best CoT finetune of Small was DeepHermes with deepseeks r1 distill patterns. According to the technical report, mistral baked their own reasoning patterns for this one so its not just another deepseek distill finetune.

HuggingFace

Blog

Magistral technical research academic paper

18
13
submitted 1 month ago by [email protected] to c/[email protected]

I'm limited to 24GB of VRAM, and I need pretty large context for my use-case (20k+). I tried "Qwen3-14B-GGUF:Q6_K_XL," but it doesn't seem to like calling tools more than a couple times, no matter how I prompt it.

Tried using "SuperThoughts-CoT-14B-16k-o1-QwQ-i1-GGUF:Q6_K" and "DeepSeek-R1-Distill-Qwen-14B-GGUF:Q6_K_L," but Ollama or LangGraph gives me an error saying these don't support tool calling.

19
12
submitted 1 month ago by [email protected] to c/[email protected]
  • It seems like it'll be the best local model that can be ran fast if you have a lot of RAM and medium VRAM.
  • It uses a shared expert (like deepseek and llama4) so it'll be even faster on partial offloaded setups.
  • There is a ton of options for fine tuning or training from one of their many partially trainined checkpoints.
  • I'm hoping for a good reasoning finetune. Hoping Nous does it.
  • It has a unique voice because it has very little synthetic data in it.

llama.CPP support is in the works, and hopefully won't take too long since it's architecture is reused from other models llamacpp already supports.

Are y'all as excited as I am? Also is there any other upcoming release that you're excited for?

20
15
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

I just set up a new dedicated AI server that is quite fast by my standards. I have it running with OpenWebUI and would like to integrate it with other services. I think it would be cool to have something like copilot where I can be writing code in a text editor and have it add a readme function or something like that. I have also used some RAG stuff and like it, but I think it would be cool to have a RAG that can access live data, like having the most up to date docker compose file and nginx configs for when I ask it about server stuff. So, what are you integrating your AI stuff with, and how can I get started?

21
27
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

Hello. Our community, c/localllama, has always been and continues to be a safe haven for those who wish to learn about the creation and local usage of 'artificial intelligence' machine learning models to enrich their daily lives and provide a fun hobby to dabble in. We come together to apply this new computational technology in ways that protect our privacy and build upon a collective effort to better understand how this can help humanity as an open source technology stack.

Unfortunately, we have been recieving an uptick in negative interactions by those outside our community recently. This is largely due to the current political tensions caused by our association with the popular and powerful tech companies who pioneered modern machine learning models for buisiness and profit, as well as unsavory techbro individuals who care more about money than ethics. These users of models continue to create animosity for the entire field of machine learning and all associated through their illegal stealing of private data to train base models and very real threats to disrupt the economy by destroying jobs through automation.

There are legitimate criticisms to be had. The cost in creating models, how the art they produce is devoid of the soulful touch of human creativity, and how corporations are attempting to disrupt lives for profit instead of enrich them.

I did not want to be heavy handed with censorship/mod actions prior to this post because I believe that echo chambers are bad and genuine understanding requires discussion between multiple conflicting perspectives.

However, a lot of these negative comments we receive lately aren't made in good faith with valid criticisms against the corporations or technologies used with an intimate understanding of them. No, instead its base level mud slinging by people with emotionally charged vendettas making nasty comments of no substance. Common examples are comparing models to NFTs, namecalling our community members as blind zelots for thinking models could ever be used to help people, and spreading misinformation with cherry picked unreliable sources to manipulatively exaggerate enviromental impact/resource consumption used.

While I am against echo chambers, I am also against our community being harassed and dragged down by bad actors who just don't understand what we do or how this works. You guys shouldn't have to be subjected to the same brain rot antagonism with every post made here.

So Im updating guidelines by adding some rules I intend to enforce. Im still debating whether or not to retroactively remove infringing comments from previous post, but be sure any new post and comments made will be enforced based on the following guidelines.

RULES: Rule: No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Reason: More or less self explanatory, personal character attacks and childish mudslinging against community members are toxic.

Rule: No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Reason: This is a piss poor whataboutism argument. It claims something that is blaitantly untrue while attempting to discredit the entire field by stapling the animosity everyone has with crypto/NFT onto ML. Models already do more than cryptocurrency ever has. Models can generate text, pictures, audio. Models can view/read/hear text, pictures, and audio. Models may simulate aspects of cognitive thought patterns to attempt to speculate or reason through a given problem. Once they are trained they can be copied and locally hosted for many thousands of years which factors into initial energy cost vs power consumed over time equations.

Rule: No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Reason: There are grains of truth to the reductionist statement that llms rely on mathematical statistics and probability for their outputs. The same can be said for humans and the statistical patterns in our own language and how our neurons come together to predict the next word in the sentence we type out. Its the intricate complexity in the process and the way information is processed that makes all the diffence. ML models have an entire college course worth of advanced mathematics and STEM concepts to create hyperdimensional matrixes to plot the relationship of information, intricate hidden translation layers made of perceptrons connecting billions of parameters into vast abstraction mappings. There were also some major innovations and discoveries made in the 2000s which made modern model training possible that we didn't have in the early days of computing. all of that is a little more complicated than what your phones autocorrect does, and the people who make the lazy reductionist comparison just dont care about the nuances.

Rule: No implying that models are devoid of purpose or potential for enriching peoples lives.

Reason: Models are tools with great potential for helping people through the creation of accessability software for the disabled and enabling doctors to better heal the sick through advanced medical diagnostic techniques. The percieved harm models are capable of causing such as job displacement is rooted in our flawed late stage capitalist human society pressures for increased profit margins at the expense of everyone and everything.

If you have any proposals for rule additions or wording changes I will hear you out in the comments. Thank you for choosing to browse and contribute to this space.

22
32
submitted 1 month ago by [email protected] to c/[email protected]

It looks like AI has followed Crypto chip wise in going CPU > GPU > ASIC

GPUs, while dominant in training large models, are often too power-hungry and costly for efficient inference at scale. This is opening new opportunities for specialized inference hardware, a market where startups like Untether AI were early pioneers.

In April, then-CEO Chris Walker had highlighted rising demand for Untether’s chips as enterprises sought alternatives to high-power GPUs. “There’s a strong appetite for processors that don’t consume as much energy as Nvidia’s energy-hungry GPUs that are pushing racks to 120 kilowatts,” Walker told CRN. Walker left Untether AI in May.

Hopefully the training part of AI goes to ASIC's to reduce costs and energy use but GPU's continue to improve inference and increase VRAM sizes to the point that AI requires nothing special to run it locally

23
9
submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

Sorry team flipped the URL’s around to prevent overflow from lemmy.world users

https://fly.io/blog/youre-all-nuts/

24
17
submitted 1 month ago by [email protected] to c/[email protected]

This seems like it's a less-than-positive development for running AI on consumer-grade hardware.

25
24
submitted 1 month ago by [email protected] to c/[email protected]

Hey everybody. I'm just getting into LLMs. Total noob. I started using llama-server's web interface, but I'm experimenting with a frontend called SillyTavern. It looks much more powerful, but there's still a lot I don't understand about it, and some design choices I found confusing.

I'm trying the Harbinger-24B model to act as a D&D-style DM, and to run one party character while I control another. I tried several general purpose models too, but I felt the Harbinger purpose-built adventure model was noticeably superior for this.

I'll write a little about my experience with it, and then some thoughts about LLMs and D&D. (Or D&D-ish. I'm not fussy about the exact thing, I just want that flavour of experience).

General Experience

I've run two scenarios. My first try was a 4/10 for my personal satisfaction, and the 2nd was 8/10. I made no changes to the prompts or anything between, so that's all due to the story the model settled into. I'm trying not to give the model any story details, so it makes everything up, and I won't know about it in advance. The first story the model invented was so-so. The second was surprisingly fun. It had historical intrigue, a tie-in to a dark family secret from ancestors of the AI-controlled char, and the dungeon-diving mattered to the overarching story. Solid marks.

My suggestion for others trying this is, if you don't get a story you like out of the model, try a few more times. You might land something much better.

The Good

Harbinger provided a nice mixture of combat and non-combat. I enjoy combat, but I also like solving mysteries and advancing the plot by talking to NPCs or finding a book in the town library, as long as it feels meaningful.

It writes fairly nice descriptions of areas you encounter, and thoughts for the AI-run character.

It seems to know D&D spells and abilities. It lets you use them in creative but very reasonable ways you could do in a pen and paper game, but can't do in a standard CRPG engine. It might let you get away with too much, so you have to keep yourself honest.

The Bad

You may have to try multiple times until the RNG gives you a nice story. You could also inject a story in the base prompt, but I want the LLM to act as a DM for me, where I'm going in completely blind. Also, in my first 4/10 game, the LLM forced really bad "main character syndrome" on me. The whole thing was about me, me, me, I'm special! I found that off putting, but the 2nd 8/10 attempt wasn't like that at all.

As an LLM, it's loosy-goosy about things like inventory, spells, rules, and character progression.

I had a difficult time giving the model OOC instructions. OOC tended to be "heard" by other characters.

Thoughts about fantasy-adventure RP and LLMs

I feel like the LLM is very good at providing descriptions, situations, and locations. It's also very good at understanding how you're trying to be creative with abilities and items, and it lets you solve problems in creative ways. It's more satisfying than a normal CRPG engine in this way.

As an LLM though, it let you steer things in ways you shouldn't be able to in an RPG with fixed rules. Like disallowing a spell you don't know, or remembering how many feet of rope you're carrying. I enjoy the character leveling and crunchy stats part of pen-and-paper or CRPGs, and I haven't found a good way to get the LLM to do that without just handling everything manually and whacking it into the context.

That leads me to think that using an LLM for creativity inside a non-LLM framework to enforce rules, stats, spells, inventory, and abilities might be phenomenal. Maybe AI-dungeon does that? Never tried, and anyway I want local. A hybrid system like that might be scriptable somehow, but I'm too much of a noob to know.

view more: next ›

LocalLLaMA

3372 readers
24 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS