koboldcpp
Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
One click deployment
Run Koboldcpp on Novita AI
GitHub List: Novita AI Templates Catalogue
What is Koboldcpp
KoboldCpp is an easy-to-use AI text-generation software for GGML and GGUF models, inspired by the original KoboldAI. It's a single self-contained distributable from Concedo, that builds off llama.cpp, and adds a versatile KoboldAI API endpoint, additional format support, Stable Diffusion image generation, speech-to-text, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything KoboldAI and KoboldAI Lite have to offer.
Key Features
-
Low Resource Requirements: It is optimized to run on consumer-grade hardware, making it accessible for users without high-end GPUs. The tool can run even on systems with lower RAM.
-
Web Interface: Koboldcpp offers a web-based interface for easy interaction. This interface provides a straightforward way to input prompts and receive generated text, making it accessible for users who prefer a graphical interface.
-
Docker Support: A Docker image is available, allowing users to deploy Koboldcpp in a containerized environment quickly. This feature ensures consistency across different systems and simplifies setup processes.
-
Cross-Platform Compatibility: Koboldcpp supports various operating systems, including Windows, macOS, and Linux, making it versatile for different user environments. KoboldCpp also supports various GGML and GGUF models of a few select formats.
Who will Use Koboldcpp
- AI Enthusiast: Create AI-generated text for creative writing, such as stories and character dialogue.
- AI Developers: Use GPU acceleration to achieve faster model performance and image generation using StableDiffusion.
- Privacy Seekers: Run open-source AI models locally and offline to complete privacy.
- Opensource AI Models Users: Run and manage multiple AI models on various platforms, including Windows, Linux, and MacOS.
What's the Usage
Windows Usage
-
Windows binaries are provided in the form of koboldcpp.exe, which is a pyinstaller wrapper containing all necessary files. Download the latest koboldcpp.exe release here
-
To run, simply execute koboldcpp.exe.
-
Launching with no command line arguments displays a GUI containing a subset of configurable settings. Generally you dont have to change much besides the
Presets
andGPU Layers
. Read the--help
for more info about each settings. -
By default, you can connect to http://localhost:5001
-
You can also run it using the command line. For info, please check
koboldcpp.exe --help
Linux Usage
On modern Linux systems, you should download the koboldcpp-linux-x64-cuda1150
prebuilt PyInstaller binary on the releases page. Simply download and run the binary.
Alternatively, you can also install koboldcpp to the current directory by running the following terminal command:
1curl -fLo koboldcpp https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1150 && chmod +x koboldcpp
After running this command you can launch Koboldcpp from the current directory using ./koboldcpp
in the terminal (for CLI usage, run with --help
).
MacOS Usage
-
PyInstaller binaries for Modern ARM64 MacOS (M1, M2, M3) are now available! Simply download and run the MacOS binary
-
Alternatively, or for older x86 MacOS computers, you can clone the repo and compile from source code, see Compiling for MacOS below.
Koboldcpp: Run on GPU Cloud VS Run on Local
Scalability and Flexibility: GPU generally offers better elastic scalability, while local hardware may need to be pre-planned to cope with future demand growth. Cost and Efficiency: In the case of limited budgets, running on local is suitable for deploying basic part, and GPU services are suitable for handling highly compute-intensive tasks. Service and Maintenance: Running locally may require maintenance by a professional system administrator, while cloud services are typically maintained by the service provider.
Run on Novita AI
KoboldCpp can now be utilized on the Novita AI Instance! This offers an easy and quick start without the need for any installation, taking just a minute or two, and is highly scalable, and capable of running models at an affordable cost. Try our Novita AI template !
View source code on Github
https://github.com/LostRuins/koboldcpp/tree/concedo
License
-
The original GGML library and llama.cpp by ggerganov are licensed under the MIT License
-
However, KoboldAI Lite is licensed under the AGPL v3.0 License
-
KoboldCpp code and other files are also under the AGPL v3.0 License unless otherwise stated
Obtaining a GGUF model
-
KoboldCpp uses GGUF models. They are not included here, but you can download GGUF files from other places such as TheBloke's Huggingface. Search for "GGUF" on huggingface.co for plenty of compatible models in the
.gguf
format. -
For beginners, we recommend the models Airoboros Mistral or Tiefighter 13B (larger model).
KoboldCpp and KoboldAI API Documentation
Notes
-
API documentation available at
-
/api
(e.g.http://localhost:5001/api
) and https://lite.koboldai.net/koboldcpp_api. An OpenAI compatible API is also provided at/v1
route (e.g.http://localhost:5001/v1
). -
All up-to-date GGUF models are supported, and KoboldCpp also includes backward compatibility for older versions/legacy GGML
.bin
models, though some newer features might be unavailable. -
An incomplete list of models and architectures is listed, but there are many hundreds of other GGUF models. In general, if it's GGUF, it should work.
-
Llama / Llama2 / Llama3 / Alpaca / GPT4All / Vicuna / Koala / Pygmalion / Metharme / WizardLM
-
Mistral / Mixtral / Miqu
-
Qwen / Qwen2 / Yi
-
Gemma / Gemma2
-
GPT-2 / Cerebras
-
Phi-2 / Phi-3
-
GPT-NeoX / Pythia / StableLM / Dolly / RedPajama
-
GPT-J / RWKV4 / MPT / Falcon / Starcoder / Deepseek and many more
-
LLaVA based Vision models and multimodal projectors (mmproj)
FAQs
How much RAM/VRAM do I need to run Koboldcpp? What about my GPU?
The amount of RAM required depends on multiple factors such as the context size, quantization type, and parameter count of the model. In general, assuming a 2048 context with a Q4_0 quantization:
-
LLAMA 3B needs at least 4GB RAM
-
LLAMA 7B needs at least 8GB RAM
-
LLAMA 13B needs at least 16GB RAM
-
LLAMA 30B needs at least 32GB RAM
-
LLAMA 65B needs at least 64GB RAM
What port does Koboldcpp use?
By default KoboldCpp uses port 5001, but this can be changed with the --port
launch parameter. You would connect your browser locally to that port for the UI or API, in the format http://localhost:port (e.g. http://localhost:5001).
How do I see the available commands and how to use them?
You can launch KoboldCpp from the command line with the --help
parameter to view the available command list. See the section on "How to use the command line terminal"
What's the difference between row and layer split
This only affects multi-GPU setups and controls how the tensors are divided between your GPUs. The most effective approach to assess performance is to experiment with both options. Generally, layer split should deliver the best overall performance, while row split may benefit some older cards.
What are .kcpps files?
.kcpps files are configuration files that store your KoboldCpp launcher preferences and settings.
Excellent Collaboration Opportunity with Novita AI
We are dedicated to providing collaboration opportunities for developers.
Get in Touch:
-
Email: support@novita.ai
-
Discord: novita.ai
Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.
Other Recommended Templates
Join Our Community
Join Discord to connect with other users and share your experiences. Provide feedback on any issues, and suggest new templates you'd like to see added.