-12
submitted 2 weeks ago by anzo@programming.dev to c/degoogle@lemmy.ml

cross-posted from: https://programming.dev/post/51407459

Check what can you use and at what rate of token per seconds would it be... It has examples of many models and quantization levels. Huge resource!

you are viewing a single comment's thread
view the rest of the comments
[-] SamuelEllis@lemmy.world 1 points 9 hours ago

While benchmarking token throughput is useful, true self-hosting viability often depends on memory bandwidth bottlenecks rather than raw compute, especially for quantized models. Have you evaluated how different quantization levels impact inference latency on consumer-grade GPUs compared to the reported token-per-second figures?

this post was submitted on 03 Jun 2026
-12 points (26.9% liked)

DeGoogle Yourself

17100 readers
152 users here now

A community for those that would like to get away from Google.

Here you may post anything related to DeGoogling, why we should do it or good software alternatives!

Rules

  1. Be respectful even in disagreement

  2. No advertising unless it is very relevent and justified. Do not do this excessively.

  3. No low value posts / memes. We or you need to learn, or discuss something.

Related communities

!privacyguides@lemmy.one !privacy@lemmy.ml !privatelife@lemmy.ml !linuxphones@lemmy.ml !fossdroid@social.fossware.space !fdroid@lemmy.ml

founded 6 years ago
MODERATORS