25
submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/[email protected]

Linked is the new repo, it's still in relatively early stages but does work.

I'm using it in oobabooga text-gen-ui and the OLD GPTQ format, so not even the new stuff, and on my 3060 I see a genuine >200% increase in speed:

Exllama v1

Output generated in 21.84 seconds (9.16 tokens/s, 200 tokens, context 135, seed 1891621432)

Exllama v2

Output generated in 6.23 seconds (32.10 tokens/s, 200 tokens, context 135, seed 313599079)

Absolutely crazy, all settings are the same. And it's not just a burst at the front, it lasts:

Output generated in 22.40 seconds (31.92 tokens/s, 715 tokens, context 135, seed 717231733)

And this is using the old format, exllama v2 includes a new way to quant, allowing for much more granular bitrates.

Turbo went with a really cool approach here, you set a target bits per weight, say, 3.5, and it'll automatically adjust the appropriate weights to the appropriate quant levels to achieve maximum performance where it counts, saving data in important weights and sacrificing more on non important ones, very cool stuff!

Get your latest oobabooga webui and start playing!

https://github.com/oobabooga/text-generation-webui

https://github.com/noneabove1182/text-generation-webui-docker

Some models in the new format from turbo: https://huggingface.co/turboderp

top 3 comments
sorted by: hot top new old
[-] [email protected] 3 points 2 years ago* (last edited 2 years ago)

~~I'm really interested in this. Is Exllama2 a separately trained variant of Llama2? The use restrictions of Llama2 have always irked me and a similarly performing open variant of that architecture is very intriguing.~~

Nevermind. This is a processor that runs the model, not the model itself. My bad.

[-] [email protected] 3 points 2 years ago

Why'd you create your own dockerfile repo vs just improving/changing the one in the main ooba repo?

[-] [email protected] 2 points 2 years ago

Good question, at the time I made it there wasn't a good option, and the one in the main repo is very comprehensive and overwhelming, I wanted to make one that was straight forward and easier to digest to see what's actually happening

this post was submitted on 13 Sep 2023
25 points (100.0% liked)

LocalLLaMA

3361 readers
16 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS