Content is user-generated and unverified.

Ollama Memory Estimation Formulas (GraphSize)

1. Legend of Variables

These variables are extracted from the GGUF Model Header.

SymbolCode VariableDescription
$B$batchBatch size (default 512)
$C$contextTotal Context Window. Important: $C = \text{num_ctx} \times \text{num_parallel}$ (pre-multiplied in code)
$N_p$numParallelNumber of parallel sequences
$E$embeddingEmbedding Length (hidden_size)
$H$headsAttention Head Count (max across layers)
$H_{kv}$headsKVKey-Value Head Count (max, for GQA)
$H^{(i)}$headsArr[i]Head count for layer $i$
$H_{kv}^{(i)}$headsKVArr[i]KV head count for layer $i$
$D$embeddingHeadsDimension per head (max): $E / H_{min}$
$D_k$embeddingHeadsKDimension of Key Head
$D_v$embeddingHeadsVDimension of Value Head
$V$vocabVocabulary Size (from tokenizer.ggml.tokens array size)
$P$bytesPerElementPrecision of the KV Cache: f16=2, q8_0=1, q4_0=0.5, f32=4
$4$-Represents 4 bytes (float32), used for graph activation tensors

2. Static Memory: KV Cache (The "Generic Loop")

This calculates the permanent storage required for the context history. It iterates through every layer $i$ from 0 to block_count - 1.

A. Standard Attention Layers (Transformer)

Used when both $H^{(i)} > 0$ and $H_{kv}^{(i)} > 0$.

$$\text{KV}{layer}^{(i)} = C \times (D_k + D_v) \times H{kv}^{(i)} \times P$$

B. Recurrent / SSM Layers (Mamba, etc.)

Used when $H^{(i)} = 0$ OR $H_{kv}^{(i)} = 0$ for a layer.

SSM Parameters:

  • $d_{conv}$: ssm.conv_kernel
  • $d_{state}$: ssm.state_size
  • $d_{inner}$: ssm.inner_size
  • $n_{groups}$: ssm.group_count
  • $P_{rec} = 4$ (always f32 for recurrent)

Intermediate calculations: $$N_{embdR} = \begin{cases} (d_{conv} - 1) \times (d_{inner} + 2 \times n_{groups} \times d_{state}) & \text{if } d_{conv} > 0 \ 0 & \text{otherwise} \end{cases}$$

$$N_{embdS} = d_{state} \times d_{inner}$$

Final KV size: $$\text{KV}{layer}^{(i)} = (N{embdR} + N_{embdS}) \times P_{rec}$$


3. Dynamic Memory: Graph Overhead (The "Switch")

This calculates the temporary scratchpad memory. The system calculates fullOffload (all on GPU) and partialOffload (split CPU/GPU).


Case: "llama", "llama4" (Includes Mistral/Mixtral)

Base Formula:

$$\text{Full} = \max\bigl(4B(1 + 4E + C(1+H)),; 4B(E+V)\bigr)$$

$$\text{Partial} = 4BE + \max\bigl(A_{attn},; A_{logit}\bigr)$$

Where: $$A_{attn} = 4B\bigl(1 + E + \max(C, E)\bigr) + \frac{9E^2}{16} + 4C(BH + D \cdot H_{kv})$$

$$A_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$


Sub-Case: Mixtral 8x22B (MoE)

Detected if blk.0.ffn_gate_exps.weight exists.

  • $FF$: feed_forward_length
  • $W_{gate}$: Size of ffn_gate_exps.weight tensor

$$\text{Partial} = \max\Bigl(3W_{gate} + 4B(2FF + H_{kv} + E + C + D_k \cdot H_{kv}),; A_{moe}\Bigr)$$

Where: $$A_{moe} = 4\bigl(C \cdot B \cdot H + C \cdot D_k \cdot H_{kv} + 1024B + D_k \cdot H_{kv} \cdot B\bigr)$$


Sub-Case: Mixtral 8x7B (MoE)

Detected if blk.0.ffn_gate.0.weight exists.

  • $W_{dim}$: Shape[1] of ffn_gate.0.weight

Full Offload: $$\text{Full} = 4B(2 + 3E + C(1+H) + 2H_{kv} + W_{dim})$$

Partial Offload: $$\text{Partial} = \max\bigl(P_1,; P_2\bigr)$$

Where: $$P_1 = 4B(3 + D_k \cdot H_{kv} + E + C(1+H) + W_{dim}) + \frac{9(E^2 + 3E \cdot H_{kv} \cdot W_{dim})}{16}$$

$$P_2 = 4B(1 + 2E + C(1+H)) + E\Bigl(\frac{6C \cdot H_{kv}}{H} + \frac{9E}{16}\Bigr)$$


Case: "mllama" (Llama 3.2 Vision)

Special KV Logic:

For layers listed in attention.cross_attention_layers, the KV size is overwritten:

$$\text{KV}{vision} = H{kv} \times (D_k + D_v) \times 4 \times 1601 \times 4$$

Where:

  • $4$ (first): sizeof(float32)
  • $1601$: vision tokens
  • $4$ (second): tiles

Graph Overhead:

$$\text{Full} = \max\bigl(4B(2 + 3E + D_k \cdot H + C(1+H)),; 4B(E+V)\bigr)$$

$$\text{Partial} = \max\bigl(P_{attn},; P_{logit}\bigr)$$

Where: $$P_{attn} = 4\bigl(B(2E + 1 + C(1+H) + D_k \cdot H) + R_{freq} + D_k \cdot C \cdot H_{kv}\bigr)$$

$$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

  • $R_{freq}$: Element count of rope_freqs.weights tensor (0 if not present)

Case: "gemma", "gemma2", "gemma3", "gemma3n"

Base Formula:

$$\text{Full} = \max\bigl(4B(E+V),; 4B(2 + C + CH + 2E + 2D_k H)\bigr)$$

$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$

Where: $$P_{logit} = 4EB + \frac{105 \cdot E \cdot V}{128} + 4VB$$

$$P_{attn} = 4B(2E + 1 + 2D_k H + C + CH) + 4D_k C \cdot 8 + \frac{9 \cdot E \cdot D_k \cdot H}{16}$$


Sub-Case: Gemma 3N

Multiplies both Full and Partial results by 4:

$$\text{Full}{gemma3n} = 4 \times \text{Full}{base}$$ $$\text{Partial}{gemma3n} = 4 \times \text{Partial}{base}$$


Sub-Case: Gemma 3 (Sliding Window)

Special KV Logic: Modifies the Generic Loop.

  • Global layers: Every 6th layer (where $(i+1) \mod 6 = 0$) uses full context $C$
  • Sliding layers: All other layers use:

$$C_{sliding} = N_p \times \text{sliding_window} + B$$

For sliding layers: $$\text{KV}{layer}^{(i)} = C{sliding} \times (D_k + D_v) \times H_{kv} \times P$$


Case: "command-r"

$$\text{Full} = \max\bigl(4B(E+V),; 4B(2 + 4E + C(1+H))\bigr)$$

$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$

Where: $$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

$$P_{attn} = 4B(1 + 2E + C(1+H)) + 4EC + \frac{9E^2}{16}$$


Case: "qwen2" (Qwen 2 / 2.5)

$$\text{Full} = \max\bigl(4B(E+V),; 4B(1 + 2E + C + CH)\bigr)$$

$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$

Where: $$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

$$P_{attn} = 4\bigl(B(1 + 2E + C(1+H)) + E(1 + C)\bigr)$$


Case: "phi2"

$$\text{Full} = \max\bigl(4B(E+V),; 4B(1 + 4E + C + CH)\bigr)$$

$$\text{Partial} = \max\bigl(P_1,; P_2\bigr)$$

Where: $$P_1 = 4B(2E + V) + \frac{105 \cdot E \cdot V}{128}$$

$$P_2 = 4B(2 + 3E + C + CH)$$


Case: "stablelm"

$$\text{Full} = 4B(C(1+H) + 3E + 2)$$

$$\text{Partial} = \max\bigl(4B(V + 2E),; \text{Full}\bigr)$$


Case: "deepseek2" (MoE)

$$\text{Full} = \max\bigl(4B(3E+V),; 4B(3E + 2 + C(1+H_{kv}) + 2D_k H_{kv})\bigr)$$

$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$

Where: $$P_{logit} = 4B(3E + V) + \frac{105 \cdot E \cdot V}{128}$$

$$P_{attn} = 4B(2E + 1 + 2D_k H_{kv} + C + C \cdot H_{kv}) + 4D_k C \cdot H_{kv} + \frac{9 \cdot E \cdot D_k \cdot H_{kv}}{16}$$


Case: "chatglm"

Base: $$\text{Full}{base} = 4B(E + V)$$ $$\text{Partial}{base} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

With attn_qkv.bias present:

Let $S_{bias}$ = Shape[0] of blk.0.attn_qkv.bias

$$\text{Full} = \max\bigl(\text{Full}{base},; 4B(2 + 2E + C + CH + D_k H + S{bias})\bigr)$$

$$\text{Partial} = \max\bigl(\text{Partial}{base},; P{attn}\bigr)$$

Where: $$P_{attn} = 4B(1 + 2E + D_k H + C + CH) + 4D_k C + 4C \cdot D_k + 4S_{bias}$$


Case: "gptoss", "gpt-oss"

Special KV Logic:

This completely overrides the Generic Loop.

For each layer $i$:

$$\text{KV}^{(i)} = (D_k + D_v) \times H_{kv} \times P \times \begin{cases} N_p \times 4096 + B & \text{if } i \mod 2 = 0 \text{ (even)} \ C & \text{if } i \mod 2 = 1 \text{ (odd)} \end{cases}$$

Graph Overhead:

$$\text{Full} = 0 \text{ (not explicitly set)}$$

$$\text{Partial} = \frac{2H}{H_{kv,min}} \times \frac{\text{KV}_{total}}{6}$$

Where $H_{kv,min}$ = minimum KV head count across layers (defaults to 1 if 0).

With Flash Attention enabled:

$$\text{Partial} = \bigl(4N_p + \lfloor C / 1024 \rfloor + 110\bigr) \times 1\text{ MiB}$$

Note: In the flash attention formula, $C$ is the pre-multiplied context ($\text{num_ctx} \times N_p$), so the actual formula expands to: $$\text{Partial} = \bigl(4N_p + \lfloor (\text{num_ctx} \times N_p) / 1024 \rfloor + 110\bigr) \times 1\text{ MiB}$$


4. Helper Functions

KV Cache Bytes Per Element

f16  → 2.0 bytes (default)
q8_0 → 1.0 bytes
q4_0 → 0.5 bytes
f32  → 4.0 bytes (used for recurrent layers)

Flash Attention Support

Models with automatic flash attention enabled:

  • gemma3
  • gptoss, gpt-oss
  • qwen3, qwen3moe
  • qwen3vl, qwen3vlmoe

Ollama Engine Required

These architectures require the Ollama engine:

  • gemma3, gemma3n
  • gptoss, gpt-oss
  • llama4
  • mistral3
  • mllama
  • qwen25vl
  • qwen3, qwen3moe
  • qwen3vl, qwen3vlmoe
  • deepseekocr
  • deepseek2
  • nomic-bert

5. Layer Assignment & Offloading Logic

This section describes how Ollama uses the calculated memory values to decide what goes on GPU vs CPU. This logic is in llm/server.go.

5.1 Graph Size Selection

Ollama chooses between fullOffload and partialOffload based on whether all layers fit:

if (GPU_layers == total_layers):
    graph_overhead = fullOffload
else:
    graph_overhead = partialOffload

Key insight: Even if just ONE layer is on CPU, the larger partialOffload graph overhead is used.

5.2 Fallback for Missing Graph Sizes

If an architecture doesn't have a case in the switch statement (or returns 0):

$$\text{graphPartialOffload}{fallback} = \frac{H}{H{kv,min}} \times \frac{\text{KV}_{total}}{6}$$

$$\text{graphFullOffload}_{fallback} = \text{graphPartialOffload}$$

Where $H_{kv,min}$ defaults to 1 if 0.

5.3 Safety Buffer Reservation

Before calculating layer fits, Ollama reserves a buffer on each GPU:

$$\text{Reserved} = \text{blk.0_weights} + \text{kv}[0]$$

This prevents edge-case OOM errors.

5.4 Layer Assignment Algorithm

The assignLayers function works as follows:

  1. Sort GPUs by free memory (descending)
  2. Calculate per-layer size: layer_size[i] = weights[i] + kv_cache[i]
  3. Reserve overhead per GPU: $$\text{Available} = \text{FreeMemory} - \text{Backoff} - \text{MinimumMemory} - \text{GpuOverhead} - \text{Graph}$$
  4. Greedy assignment (from last layer backwards):
    • Start with GPU with most free space
    • Assign layers until GPU is full
    • Spill to next GPU
    • If no GPU has space, remaining layers go to CPU
  5. Output layer handling: If everything doesn't fit, try dropping output layer and retry

5.5 The Split Calculation Formula

To predict the GPU/CPU split percentage:

$$\text{GPU Ratio} = \frac{\text{Layers on GPU}}{\text{Total Layers}}$$

Where layers on GPU is determined by:

$$\text{Available VRAM for weights} = \text{VRAM} - \text{KV Cache} - \text{Graph Overhead} - \text{Buffer}$$

$$\text{Layers on GPU} = \left\lfloor \frac{\text{Available VRAM for weights}}{\text{Average layer size}} \right\rfloor$$

5.6 Complete Memory Budget Equation

For a single GPU setup, the total memory required is:

$$\text{Total} = \underbrace{W_{model}}{\text{Weights}} + \underbrace{\sum{i=0}^{L-1} \text{KV}^{(i)}}{\text{KV Cache}} + \underbrace{\text{Graph}}{\text{Overhead}} + \underbrace{\text{Buffer}}_{\text{Safety}}$$

The split happens when:

$$\text{Total} > \text{VRAM}$$

In this case:

  1. KV Cache is allocated first (always on GPU if layers are on GPU)
  2. Graph overhead is allocated
  3. Remaining VRAM is filled with layer weights
  4. Overflow goes to CPU RAM

5.7 Verification Formula

To verify your observed split:

$$\text{GPU %} = \frac{\text{VRAM} - \text{KV}{total} - \text{Graph}{partial} - \text{Buffer}}{\text{Model Weights}}$$

If result < 1.0, you will see partial CPU offloading.

Example for Command-R 35B @ 32K context on L4 (24GB):

ComponentValue
VRAM24 GB
KV Cache4.88 GB
Graph (partial)5.01 GB
Buffer (~1 layer)~0.5 GB
Available for weights13.61 GB
Model Weights19 GB
GPU Ratio71.6%

6. Memory Estimation Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                         OLLAMA MEMORY ESTIMATION FLOW                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  STEP 1: GraphSize() in ggml.go                                      │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Part A: Generic KV Cache Loop                                 │  │   │
│  │  │  - Iterates all layers                                         │  │   │
│  │  │  - Calculates KV per layer based on attention/recurrent type   │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Part B: Architecture Switch                                   │  │   │
│  │  │  - Calculates fullOffload and partialOffload                   │  │   │
│  │  │  - May override KV cache (gemma3, gptoss, mllama)              │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  Returns: kv[], partialOffload, fullOffload                          │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    ↓                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  STEP 2: Load() in server.go                                         │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Reserve Safety Buffer                                         │  │   │
│  │  │  FreeMemory -= (blk.0 weights + kv[0])                         │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Fallback Graph Size (if architecture returned 0)              │  │   │
│  │  │  graphPartialOffload = (H / H_kv) * kvTotal / 6                │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  createLayout() → buildLayout() → assignLayers()               │  │   │
│  │  │  - Sorts GPUs by free memory                                   │  │   │
│  │  │  - Greedily assigns layers from last→first                     │  │   │
│  │  │  - Spills to CPU when GPU full                                 │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Select Final Graph Size                                       │  │   │
│  │  │  if (all layers on GPU) → fullOffload                          │  │   │
│  │  │  else → partialOffload                                         │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    ↓                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  FINAL MEMORY ALLOCATION                                             │   │
│  │                                                                       │   │
│  │  GPU: Σ(layer_weights + layer_kv) for assigned layers + graph        │   │
│  │  CPU: Σ(layer_weights + layer_kv) for remaining layers               │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

7. Code Reference

7.1 GraphSize Function (fs/ggml/ggml.go)

go
func (f GGML) GraphSize(context, batch uint64, numParallel int, kvCacheType string, useFlashAttention bool) (kv []uint64, partialOffload, fullOffload uint64)

Parameters:

  • context: Base context window size (will be multiplied by numParallel internally)
  • batch: Batch size
  • numParallel: Number of parallel sequences
  • kvCacheType: Cache precision ("f16", "q8_0", "q4_0", "f32")
  • useFlashAttention: Whether flash attention is enabled

Returns:

  • kv: Slice of KV cache sizes per layer (in bytes)
  • partialOffload: Graph overhead for partial GPU offload (in bytes)
  • fullOffload: Graph overhead for full GPU offload (in bytes)

7.2 Load Function (llm/server.go)

go
func (s *llamaServer) Load(ctx context.Context, systemInfo ml.SystemInfo, gpus []ml.DeviceInfo, requireFull bool) ([]ml.DeviceID, error)

Key operations:

  1. Calls GraphSize() to get memory estimates
  2. Reserves safety buffer (one layer)
  3. Applies fallback formulas if needed
  4. Creates layout via createLayout()buildLayout()assignLayers()
  5. Selects final graph size based on offload status

7.3 Key Environment Variables

VariableDefaultDescription
OLLAMA_NUM_PARALLEL1Number of parallel sequences (multiplies context)
OLLAMA_CONTEXT_LENGTHModel defaultContext window size
OLLAMA_FLASH_ATTENTIONModel-dependentEnable flash attention
OLLAMA_KV_CACHE_TYPE"f16"KV cache precision
OLLAMA_GPU_OVERHEAD0Extra VRAM to reserve (bytes)
Content is user-generated and unverified.
    Ollama Memory Estimation Formulas: Complete Technical Guide | Claude