Guide on setting up a local GGML model? (lemmy.world)

submitted 2 years ago* (last edited 2 years ago) by [email protected] to c/[email protected]

11 comments fedilink hide all child comments

I've been messing around with GPTQ models with ExLlama in ooba, and have gotten 33b models @ 3k running smoothly, but was looking to try something bigger than my VRAM can hold.

However, I'm clearly doing something wrong, and the koboldcpp.exe documentation isn't clear to me. Does anyone have a good setup guide? My understanding is koboldcpp.exe is preferable for GGML, as ooba's llama.cpp doesn't support GGML at >4k context yet.

top 4 comments

sorted by: hot top new old

[-] [email protected] 4 points 2 years ago* (last edited 2 years ago)

KoboldCpp has documentation on the github page. Maybe just google for other guides if the documentation doesn't do it for you.

My advice is: Do one step at a time. Get it running first, without fancy stuff. Start with a small model and without gpu acceleration. Then get the acceleration/CUDA working. Then try with a bigger model. And then you can do the elaborate stuff like having some layers in VRAM and others in RAM and blowing up the context size past 2048/default. Don't do it all at once. That way you might figure out your problem and at which of the steps it happens.

(Edit: And make sure to always use the latest version. You're playing with pretty recent stuff that still might have bugs.)

I can't say much about the windows stuff or the state of the integration layers in oobabooga's.

[-] [email protected] 2 points 2 years ago

What's the problem you're having with kobold? It doesn't really require any setup. Download the exe, click on it, select model in the window, click launch. The webui should open in your default browser.

[-] [email protected] 1 points 2 years ago* (last edited 2 years ago)

Note this is koboldcpp.exe and not KoboldAI.

The Github describes arguments to use GPU acceleration, but it is fuzzy on what the arguments do and completely neglects to mention what the values for those arguments do. I understand the --gpulayers arg, but the two ints after --useclblast are lost on me. I defaulted to "[path]\koboldcpp.exe --useclblast 0 0 --gpulayers 40", but it seems to be completely ignoring GPU acceleration, and I'm clueless where the problem lies. I figured it would be easier to ask for a guide and just start my GGML setup from scratch.

[-] [email protected] 2 points 2 years ago

Those are OpenCL platform and device identifiers, you can use clinfo to find out which numbers are what on your system.

Also note that if you're building kobold.cpp yourself, you need to build with LLAMA_CLBLAST=1 for OpenCL support to exist in the first place. Or LLAMA_CUBLAS for CUDA.

load more comments

this post was submitted on 11 Jul 2023

16 points (100.0% liked)

LocalLLaMA

3332 readers

1 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

[email protected]