koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

Framework
LostRuins2024-08-28Github

One click deployment

On Demand
README

Run Koboldcpp on Novita AI

GitHub List: Novita AI Templates Catalogue

What is Koboldcpp

KoboldCpp is an easy-to-use AI text-generation software for GGML and GGUF models, inspired by the original KoboldAI. It's a single self-contained distributable from Concedo, that builds off llama.cpp, and adds a versatile KoboldAI API endpoint, additional format support, Stable Diffusion image generation, speech-to-text, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything KoboldAI and KoboldAI Lite have to offer.

Key Features

  • Low Resource Requirements: It is optimized to run on consumer-grade hardware, making it accessible for users without high-end GPUs. The tool can run even on systems with lower RAM.

  • Web Interface: Koboldcpp offers a web-based interface for easy interaction. This interface provides a straightforward way to input prompts and receive generated text, making it accessible for users who prefer a graphical interface.

  • Docker Support: A Docker image is available, allowing users to deploy Koboldcpp in a containerized environment quickly. This feature ensures consistency across different systems and simplifies setup processes.

  • Cross-Platform Compatibility: Koboldcpp supports various operating systems, including Windows, macOS, and Linux, making it versatile for different user environments. KoboldCpp also supports various GGML and GGUF models of a few select formats.

Who will Use Koboldcpp

  • AI Enthusiast: Create AI-generated text for creative writing, such as stories and character dialogue.
  • AI Developers: Use GPU acceleration to achieve faster model performance and image generation using StableDiffusion.
  • Privacy Seekers: Run open-source AI models locally and offline to complete privacy.
  • Opensource AI Models Users: Run and manage multiple AI models on various platforms, including Windows, Linux, and MacOS.

What's the Usage

Windows Usage

  • Windows binaries are provided in the form of koboldcpp.exe, which is a pyinstaller wrapper containing all necessary files. Download the latest koboldcpp.exe release here

  • To run, simply execute koboldcpp.exe.

  • Launching with no command line arguments displays a GUI containing a subset of configurable settings. Generally you dont have to change much besides the Presets and GPU Layers. Read the --help for more info about each settings.

  • By default, you can connect to http://localhost:5001

  • You can also run it using the command line. For info, please check koboldcpp.exe --help

Linux Usage

On modern Linux systems, you should download the koboldcpp-linux-x64-cuda1150 prebuilt PyInstaller binary on the releases page. Simply download and run the binary.

Alternatively, you can also install koboldcpp to the current directory by running the following terminal command:

1curl -fLo koboldcpp https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1150 && chmod +x koboldcpp

After running this command you can launch Koboldcpp from the current directory using ./koboldcpp in the terminal (for CLI usage, run with --help).

MacOS Usage

  • PyInstaller binaries for Modern ARM64 MacOS (M1, M2, M3) are now available! Simply download and run the MacOS binary

  • Alternatively, or for older x86 MacOS computers, you can clone the repo and compile from source code, see Compiling for MacOS below.

Koboldcpp: Run on GPU Cloud VS Run on Local

Scalability and Flexibility: GPU generally offers better elastic scalability, while local hardware may need to be pre-planned to cope with future demand growth. Cost and Efficiency: In the case of limited budgets, running on local is suitable for deploying basic part, and GPU services are suitable for handling highly compute-intensive tasks. Service and Maintenance: Running locally may require maintenance by a professional system administrator, while cloud services are typically maintained by the service provider.

Run on Novita AI

KoboldCpp can now be utilized on the Novita AI Instance! This offers an easy and quick start without the need for any installation, taking just a minute or two, and is highly scalable, and capable of running models at an affordable cost. Try our Novita AI template !

View source code on Github

https://github.com/LostRuins/koboldcpp/tree/concedo

License

  • The original GGML library and llama.cpp by ggerganov are licensed under the MIT License

  • However, KoboldAI Lite is licensed under the AGPL v3.0 License

  • KoboldCpp code and other files are also under the AGPL v3.0 License unless otherwise stated

Obtaining a GGUF model

  • KoboldCpp uses GGUF models. They are not included here, but you can download GGUF files from other places such as TheBloke's Huggingface. Search for "GGUF" on huggingface.co for plenty of compatible models in the .gguf format.

  • For beginners, we recommend the models Airoboros Mistral or Tiefighter 13B (larger model).

KoboldCpp and KoboldAI API Documentation

Notes

  • API documentation available at

  • /api (e.g. http://localhost:5001/api) and https://lite.koboldai.net/koboldcpp_api. An OpenAI compatible API is also provided at /v1 route (e.g. http://localhost:5001/v1).

  • All up-to-date GGUF models are supported, and KoboldCpp also includes backward compatibility for older versions/legacy GGML .bin models, though some newer features might be unavailable.

  • An incomplete list of models and architectures is listed, but there are many hundreds of other GGUF models. In general, if it's GGUF, it should work.

  • Llama / Llama2 / Llama3 / Alpaca / GPT4All / Vicuna / Koala / Pygmalion / Metharme / WizardLM

  • Mistral / Mixtral / Miqu

  • Qwen / Qwen2 / Yi

  • Gemma / Gemma2

  • GPT-2 / Cerebras

  • Phi-2 / Phi-3

  • GPT-NeoX / Pythia / StableLM / Dolly / RedPajama

  • GPT-J / RWKV4 / MPT / Falcon / Starcoder / Deepseek and many more

  • Stable Diffusion 1.5 and SDXL safetensor models

  • LLaVA based Vision models and multimodal projectors (mmproj)

  • Whisper models for Speech-To-Text

FAQs

How much RAM/VRAM do I need to run Koboldcpp? What about my GPU?

The amount of RAM required depends on multiple factors such as the context size, quantization type, and parameter count of the model. In general, assuming a 2048 context with a Q4_0 quantization:

  • LLAMA 3B needs at least 4GB RAM

  • LLAMA 7B needs at least 8GB RAM

  • LLAMA 13B needs at least 16GB RAM

  • LLAMA 30B needs at least 32GB RAM

  • LLAMA 65B needs at least 64GB RAM

What port does Koboldcpp use?

By default KoboldCpp uses port 5001, but this can be changed with the --port launch parameter. You would connect your browser locally to that port for the UI or API, in the format http://localhost:port (e.g. http://localhost:5001).

How do I see the available commands and how to use them?

You can launch KoboldCpp from the command line with the --help parameter to view the available command list. See the section on "How to use the command line terminal"

What's the difference between row and layer split

This only affects multi-GPU setups and controls how the tensors are divided between your GPUs. The most effective approach to assess performance is to experiment with both options. Generally, layer split should deliver the best overall performance, while row split may benefit some older cards.

What are .kcpps files?

.kcpps files are configuration files that store your KoboldCpp launcher preferences and settings.

Excellent Collaboration Opportunity with Novita AI

We are dedicated to providing collaboration opportunities for developers.

enter image description here

Get in Touch:


Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Other Recommended Templates

Facefusion v2.6.0

Seamlessly merge and enhance faces with Facefusion v2.6.0

View more

Stable Diffusion v1.8.0

Unlock the power of creativity with Stable Diffusion v1.8.0

View more

PyTorch v2.2.1

Elevate Your AI Models with PyTorch v2.2.1 on Novita AI

View more

TensorFlow 2.7.0

Effortless AI and ML workflow with TensorFlow 2.7.0 on Novita AI.

View more

Ollama Open WebUI

Streamline Your AI Workflows with Ollama Open WebUI

View more
Join Our Community

Join Discord to connect with other users and share your experiences. Provide feedback on any issues, and suggest new templates you'd like to see added.

Join Discord