Kolosal - Blog

Discover how to calculate memory requirements for local LLMs and run language models on even low-end PCs. We'll cover all factors that affect RAM usage – model size, KV cache, precision, batch size, etc. – and show how to compute model weight and cache usage. Along the way, we'll share practical examples (Ollama, LM Studio, Kolosal.ai) and give optimization tips for CPU-only LLM deployment. By the end, you'll know exactly how much RAM you need to run different LLMs locally, and how to optimize for your hardware.

Factors Affecting Memory Requirements

Model size (parameters and precision): Model weights dominate memory usage. For example, 7B FP16 = ~14 GB. Lower precision (INT8) cuts this in half.
KV cache: Grows with context length and layers. A 4096-token context on 7B uses ~2 GB.
Batch size: Higher batch = more memory. Stick to batch=1 for minimal RAM use.
Other factors: Concurrency, beam search, and training states add memory needs.

How to Calculate Memory Usage

Model weight (in GB) ≈ parameter count × bytes per parameter. E.g., 7B FP16 = 7e9 × 2 bytes = ~14 GB. Add ~10–20% overhead. KV cache ≈ batch × seq_len × layers × heads × head_dim × 2 × bytes. Llama 7B (4096 tokens) needs ~2 GB cache.

Practical Examples

Ollama: 7B model = ~16–32 GB RAM. 13B = 32–64 GB.
LM Studio: Recommends 16 GB+ for most models.
Kolosal.ai: Optimized for 1–3B on ~8–16 GB. Ideal for low-end PCs.

Typical Memory Needs by Model Size

Model Size	Weights	KV Cache	Total RAM
7B	~14 GB	~1 GB	16–32 GB
13B	~26 GB	~1.6 GB	32–64 GB
33B	~66 GB	~3 GB	80+ GB

Tips for CPU-Only Optimization

Quantize weights: Use INT8 or INT4 to cut memory use by 2–4×.
Reduce context: Halve token count to halve KV memory.
Keep batch size = 1: Simplest way to reduce usage.
Efficient attention: Use FlashAttention or Grouped Attention.
Smaller models: Pick 1–3B if limited RAM.
Disk offload: Memory map weights to save RAM (slower).
Use good tools: Kolosal.ai, Llama.cpp, Ollama automate optimizations.

FAQ

How much RAM for 7B? 16–32 GB recommended.
What is KV cache? Stores key/value vectors for fast generation. Grows with context.
Can I run LLMs without GPU? Yes, CPU works fine with enough RAM. Use quantized models.
How does sequence length affect RAM? Linearly. Longer input = more KV cache needed.
Tools to estimate RAM? Hugging Face Accelerate, Kolosal.ai calculators, or param × bytes estimate.

Estimating RAM for local LLMs is key for smooth performance. Sum model weight + KV cache to plan your setup. Use smaller or quantized models for low-end PCs. Try Kolosal.ai for an easy, optimized way to run LLMs locally.

Memory Requirements for Local LLMs: Accurately Estimate RAM Needs

Factors Affecting Memory Requirements

How to Calculate Memory Usage

Practical Examples

Typical Memory Needs by Model Size

Tips for CPU-Only Optimization

FAQ