Don't show again

koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

Framework

Original Author : LostRuinsUpdate Time : 2024-08-28Github

One click deployment

On Demand

Deploy

README

Run Koboldcpp on Novita AI

GitHub List: Novita AI Templates Catalogue

What is Koboldcpp

KoboldCpp is an easy-to-use AI text-generation software for GGML and GGUF models, inspired by the original KoboldAI. It's a single self-contained distributable from Concedo, that builds off llama.cpp, and adds a versatile KoboldAI API endpoint, additional format support, Stable Diffusion image generation, speech-to-text, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything KoboldAI and KoboldAI Lite have to offer.

Key Features

Low Resource Requirements: It is optimized to run on consumer-grade hardware, making it accessible for users without high-end GPUs. The tool can run even on systems with lower RAM.
Web Interface: Koboldcpp offers a web-based interface for easy interaction. This interface provides a straightforward way to input prompts and receive generated text, making it accessible for users who prefer a graphical interface.
Docker Support: A Docker image is available, allowing users to deploy Koboldcpp in a containerized environment quickly. This feature ensures consistency across different systems and simplifies setup processes.
Cross-Platform Compatibility: Koboldcpp supports various operating systems, including Windows, macOS, and Linux, making it versatile for different user environments. KoboldCpp also supports various GGML and GGUF models of a few select formats.

Who will Use Koboldcpp

AI Enthusiast: Create AI-generated text for creative writing, such as stories and character dialogue.
AI Developers: Use GPU acceleration to achieve faster model performance and image generation using StableDiffusion.
Privacy Seekers: Run open-source AI models locally and offline to complete privacy.
Opensource AI Models Users: Run and manage multiple AI models on various platforms, including Windows, Linux, and MacOS.

What's the Usage

Windows Usage

Windows binaries are provided in the form of koboldcpp.exe, which is a pyinstaller wrapper containing all necessary files. Download the latest koboldcpp.exe release here
To run, simply execute koboldcpp.exe.
Launching with no command line arguments displays a GUI containing a subset of configurable settings. Generally you dont have to change much besides the Presets and GPU Layers. Read the --help for more info about each settings.
By default, you can connect to http://localhost:5001
You can also run it using the command line. For info, please check koboldcpp.exe --help

Linux Usage

On modern Linux systems, you should download the koboldcpp-linux-x64-cuda1150 prebuilt PyInstaller binary on the releases page. Simply download and run the binary.

Alternatively, you can also install koboldcpp to the current directory by running the following terminal command:

1curl -fLo koboldcpp https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1150 && chmod +x koboldcpp

After running this command you can launch Koboldcpp from the current directory using ./koboldcpp in the terminal (for CLI usage, run with --help).

MacOS Usage

PyInstaller binaries for Modern ARM64 MacOS (M1, M2, M3) are now available! Simply download and run the MacOS binary
Alternatively, or for older x86 MacOS computers, you can clone the repo and compile from source code, see Compiling for MacOS below.

Koboldcpp: Run on GPU Cloud VS Run on Local

Scalability and Flexibility: GPU generally offers better elastic scalability, while local hardware may need to be pre-planned to cope with future demand growth. Cost and Efficiency: In the case of limited budgets, running on local is suitable for deploying basic part, and GPU services are suitable for handling highly compute-intensive tasks. Service and Maintenance: Running locally may require maintenance by a professional system administrator, while cloud services are typically maintained by the service provider.

Run on Novita AI

KoboldCpp can now be utilized on the Novita AI Instance! This offers an easy and quick start without the need for any installation, taking just a minute or two, and is highly scalable, and capable of running models at an affordable cost. Try our Novita AI template !

View source code on Github

https://github.com/LostRuins/koboldcpp/tree/concedo

License

The original GGML library and llama.cpp by ggerganov are licensed under the MIT License
However, KoboldAI Lite is licensed under the AGPL v3.0 License
KoboldCpp code and other files are also under the AGPL v3.0 License unless otherwise stated

Obtaining a GGUF model

KoboldCpp uses GGUF models. They are not included here, but you can download GGUF files from other places such as TheBloke's Huggingface. Search for "GGUF" on huggingface.co for plenty of compatible models in the .gguf format.
For beginners, we recommend the models Airoboros Mistral or Tiefighter 13B (larger model).

KoboldCpp and KoboldAI API Documentation

Documentation for KoboldAI and KoboldCpp endpoints can be found here

Notes

API documentation available at
/api (e.g. http://localhost:5001/api) and https://lite.koboldai.net/koboldcpp_api. An OpenAI compatible API is also provided at /v1 route (e.g. http://localhost:5001/v1).
All up-to-date GGUF models are supported, and KoboldCpp also includes backward compatibility for older versions/legacy GGML .bin models, though some newer features might be unavailable.
An incomplete list of models and architectures is listed, but there are many hundreds of other GGUF models. In general, if it's GGUF, it should work.
Llama / Llama2 / Llama3 / Alpaca / GPT4All / Vicuna / Koala / Pygmalion / Metharme / WizardLM
Mistral / Mixtral / Miqu
Qwen / Qwen2 / Yi
Gemma / Gemma2
GPT-2 / Cerebras
Phi-2 / Phi-3
GPT-NeoX / Pythia / StableLM / Dolly / RedPajama
GPT-J / RWKV4 / MPT / Falcon / Starcoder / Deepseek and many more
Stable Diffusion 1.5 and SDXL safetensor models
LLaVA based Vision models and multimodal projectors (mmproj)
Whisper models for Speech-To-Text

FAQs

How much RAM/VRAM do I need to run Koboldcpp? What about my GPU?

The amount of RAM required depends on multiple factors such as the context size, quantization type, and parameter count of the model. In general, assuming a 2048 context with a Q4_0 quantization:

LLAMA 3B needs at least 4GB RAM
LLAMA 7B needs at least 8GB RAM
LLAMA 13B needs at least 16GB RAM
LLAMA 30B needs at least 32GB RAM
LLAMA 65B needs at least 64GB RAM

What port does Koboldcpp use?

By default KoboldCpp uses port 5001, but this can be changed with the --port launch parameter. You would connect your browser locally to that port for the UI or API, in the format http://localhost:port (e.g. http://localhost:5001).

How do I see the available commands and how to use them?

You can launch KoboldCpp from the command line with the --help parameter to view the available command list. See the section on "How to use the command line terminal"

What's the difference between row and layer split

This only affects multi-GPU setups and controls how the tensors are divided between your GPUs. The most effective approach to assess performance is to experiment with both options. Generally, layer split should deliver the best overall performance, while row split may benefit some older cards.

What are .kcpps files?

.kcpps files are configuration files that store your KoboldCpp launcher preferences and settings.

Get in Touch:

Email: iris@novita.ai
Discord: novita.ai

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Other Recommended Templates

MiniCPM-V-2_6

Empower Your Applications with MiniCPM-V 2.6 on Novita AI.

Kohya-SS

Unleash the Power of Kohya-SS with Novita AI

stable-diffusion-3-medium

Transform Creativity with Stable Diffusion 3 Medium on Novita AI

Qwen2-Audio-7B-Instruct

Empower Your Audio with Qwen2 on Novita AI

Llama3.1-8B

Run Llama3.1-8B with SGlang on Novita AI

Ready to build smarter? Start today.

Get started with Novita AI and unlock the power of affordable, reliable, and scalable AI inference for your applications.

Get Started