Train and Run LLMs
On Your Device

Secure, private, and customizable LLM at your fingertips. Fully open-source and designed for offline performance.

Kolosal AI Application Demo

Made with passion on top of

    C++
    Python
    Genta logo
    Genta
    Unsloth logo
    Unsloth
    TensorRT
    HuggingFace logo
    HuggingFace

Made for Simplicity, Flexibility, and Speed.

Discover how our features make AI accessible on any device, empowering you to build, customize, and deploy powerful models with ease and control.

Kolosal AI is an easy-to-install, cross-platform app built in C++ and ImGui. It runs LLMs, manages Retrieval-Augmented Generation (RAG), memory, and kolosal plane jobs for fine-tuning and dataset creation on more powerful device.

Image for item-1

Describe, Train, Compile

We simplify the training process from dataset building from data synthesizing on your profile to fine-tuning and optimizing your models.

Step 1

Data Synthesis

Generate your profile based on your preferences through an interactive, chat-like conversation.

This process generate 2 results:
Interests
Used to create example conversation starters tailored to your specific needs.
Tone and Style
Defines the type of responses you prefers.
Generate conversation based on your interests, tone, and style.
What is the best way to make an AI application?
Finetune a Small Language Model cool, you know.
[Optional] Generate unwanted response based on your interests, tone, and style.
What is the best way to make an AI application?
Based on the provided context, use GPT4 for text classification.
Step 2

Training

Model training process is done in two steps:
Supervised Finetuning
Made model follow instructions and answer questions.
Preference Alignment
Additional control to remove unwanted responses, modify their style, and more.
Supervised finetuning is done by providing the model with the generated conversation and the desired response.
Query
Prediction
Evaluate
Fix
[Optional] Align the model's preferences with the user's profile and the unwanted user's preference.
Query
Prediction
Penalty Scoring
Update
Step 3

Model Optimization

Quantize the model into various fp8 or int5 to reduce memory footprints and increase inference speed.
fp8
Default format, balance in speed and accuracy
int4 AWQ
2x faster than fp8, but less accurate
KV Cache quantizations to further reduce memory footprints and increase inference speed.
fp16
Default format
fp8
>Ada GPUs
int8
Any GPU
LoRA mapping without needing to merge the weights, allowing LoRA swapping to be done on the fly.
Base Model
Hello
Bahasa LoRA
Halo
Chinese LoRA
你好

From Personal to Enterprise

Kolosal AI designed to empower everyone from individual creators to large enterprises. Whether you need open-source flexibility for personal projects or robust capabilities for enterprise demands, Kolosal scales to fit your AI needs.

Kolosal
For Individuals and Small Teams
Apache 2.0
On Device Inference
Run Models on Your Device Privately
Multi LoRAs
Run Multiple Models in Real-time without Overhead
Data Synthesis
Generate Synthetic Data for Training from Prompts and Documents
LLM Fine-tuning
Fine-tune Models with Your Own Personalization and Data
Embedding Fine-tuning
Improve Retrieval Accuracy with Your Own Data
Document RAG
Talk to Your Documents and Get Answers
On Device API
Use Models in Your Own Apps and Services
LLM Based Evaluation
Evaluate Models using Larger Models
Kolosal Enterprise
For Large Teams and Organizations to Serve Millions
Proprietary
Inflight Batching
Real-time Batch Processing With No Delay
No Batch Limit
Unlimited Number of Concurrent Batches
Guardrails
Safeguards to Prevent Unintended Actions and Responses
Multi-GPU
Deploy and Run Large Models on Large Infrastructure

Join our Revolution!

Join us to bring AI into everyone's hands.
Own your AI, and shape the future together.