Made for Simplicity, Flexibility, and Speed.
Discover how our features make AI accessible on any device, empowering you to build, customize, and deploy powerful models with ease and control.
Kolosal AI is an easy-to-install, cross-platform app built in C++ and ImGui. It runs LLMs, manages Retrieval-Augmented Generation (RAG), memory, and kolosal plane jobs for fine-tuning and dataset creation on more powerful device.
Describe, Train, Compile
We simplify the training process from dataset building from data synthesizing on your profile to fine-tuning and optimizing your models.
Step 1
Data Synthesis
Generate your profile based on your preferences through an interactive, chat-like conversation.
This process generate 2 results:
This process generate 2 results:
Interests
Used to create example conversation starters tailored to your specific needs.
Used to create example conversation starters tailored to your specific needs.
Tone and Style
Defines the type of responses you prefers.
Defines the type of responses you prefers.
Generate conversation based on your interests, tone, and style.
What is the best way to make an AI application?
Finetune a Small Language Model cool, you know.
[Optional] Generate unwanted response based on your interests, tone, and style.
What is the best way to make an AI application?
Based on the provided context, use GPT4 for text classification.
Step 2
Training
Model training process is done in two steps:
Supervised Finetuning
Made model follow instructions and answer questions.
Made model follow instructions and answer questions.
Preference Alignment
Additional control to remove unwanted responses, modify their style, and more.
Additional control to remove unwanted responses, modify their style, and more.
Supervised finetuning is done by providing the model with the generated conversation and the desired response.
Query
Prediction
Evaluate
Fix
[Optional] Align the model's preferences with the user's profile and the unwanted user's preference.
Query
Prediction
Penalty Scoring
Update
Step 3
Model Optimization
Quantize the model into various fp8 or int5 to reduce memory footprints and increase inference speed.
fp8
Default format, balance in speed and accuracy
Default format, balance in speed and accuracy
int4 AWQ
2x faster than fp8, but less accurate
2x faster than fp8, but less accurate
KV Cache quantizations to further reduce memory footprints and increase inference speed.
fp16
Default format
Default format
fp8
>Ada GPUs
>Ada GPUs
int8
Any GPU
Any GPU
LoRA mapping without needing to merge the weights, allowing LoRA swapping to be done on the fly.
Base Model
Hello
Hello
Bahasa LoRA
Halo
Halo
Chinese LoRA
你好
你好
From Personal to Enterprise
Kolosal AI designed to empower everyone from individual creators to large enterprises. Whether you need open-source flexibility for personal projects or robust capabilities for enterprise demands, Kolosal scales to fit your AI needs.
Kolosal
For Individuals and Small TeamsOn Device Inference
Run Models on Your Device Privately
Multi LoRAs
Run Multiple Models in Real-time without Overhead
Data Synthesis
Generate Synthetic Data for Training from Prompts and Documents
LLM Fine-tuning
Fine-tune Models with Your Own Personalization and Data
Embedding Fine-tuning
Improve Retrieval Accuracy with Your Own Data
Document RAG
Talk to Your Documents and Get Answers
On Device API
Use Models in Your Own Apps and Services
LLM Based Evaluation
Evaluate Models using Larger Models
Kolosal Enterprise
For Large Teams and Organizations to Serve MillionsInflight Batching
Real-time Batch Processing With No Delay
No Batch Limit
Unlimited Number of Concurrent Batches
Guardrails
Safeguards to Prevent Unintended Actions and Responses
Multi-GPU
Deploy and Run Large Models on Large Infrastructure