Content is user-generated and unverified.

Ollama Memory Estimation Formulas (`GraphSize`)

1. Legend of Variables

These variables are extracted from the GGUF Model Header.

Symbol	Code Variable	Description
$B$	`batch`	Batch size (default 512)
$C$	`context`	Total Context Window. Important: $C = \text{num_ctx} \times \text{num_parallel}$ (pre-multiplied in code)
$N_p$	`numParallel`	Number of parallel sequences
$E$	`embedding`	Embedding Length (`hidden_size`)
$H$	`heads`	Attention Head Count (max across layers)
$H_{kv}$	`headsKV`	Key-Value Head Count (max, for GQA)
$H^{(i)}$	`headsArr[i]`	Head count for layer $i$
$H_{kv}^{(i)}$	`headsKVArr[i]`	KV head count for layer $i$
$D$	`embeddingHeads`	Dimension per head (max): $E / H_{min}$
$D_k$	`embeddingHeadsK`	Dimension of Key Head
$D_v$	`embeddingHeadsV`	Dimension of Value Head
$V$	`vocab`	Vocabulary Size (from `tokenizer.ggml.tokens` array size)
$P$	`bytesPerElement`	Precision of the KV Cache: `f16`=2, `q8_0`=1, `q4_0`=0.5, `f32`=4
$4$	-	Represents 4 bytes (float32), used for graph activation tensors

2. Static Memory: KV Cache (The "Generic Loop")

This calculates the permanent storage required for the context history. It iterates through every layer $i$ from 0 to block_count - 1.

A. Standard Attention Layers (Transformer)

Used when both $H^{(i)} > 0$ and $H_{kv}^{(i)} > 0$.

$$\text{KV}{layer}^{(i)} = C \times (D_k + D_v) \times H{kv}^{(i)} \times P$$

B. Recurrent / SSM Layers (Mamba, etc.)

Used when $H^{(i)} = 0$ OR $H_{kv}^{(i)} = 0$ for a layer.

SSM Parameters:

$d_{conv}$: ssm.conv_kernel
$d_{state}$: ssm.state_size
$d_{inner}$: ssm.inner_size
$n_{groups}$: ssm.group_count
$P_{rec} = 4$ (always f32 for recurrent)

Intermediate calculations: $$N_{embdR} = \begin{cases} (d_{conv} - 1) \times (d_{inner} + 2 \times n_{groups} \times d_{state}) & \text{if } d_{conv} > 0 \ 0 & \text{otherwise} \end{cases}$$

$$N_{embdS} = d_{state} \times d_{inner}$$

Final KV size: $$\text{KV}{layer}^{(i)} = (N{embdR} + N_{embdS}) \times P_{rec}$$

3. Dynamic Memory: Graph Overhead (The "Switch")

This calculates the temporary scratchpad memory. The system calculates fullOffload (all on GPU) and partialOffload (split CPU/GPU).

Case: "llama", "llama4" (Includes Mistral/Mixtral)

Base Formula:

$$\text{Full} = \max\bigl(4B(1 + 4E + C(1+H)),; 4B(E+V)\bigr)$$

$$\text{Partial} = 4BE + \max\bigl(A_{attn},; A_{logit}\bigr)$$

Where: $$A_{attn} = 4B\bigl(1 + E + \max(C, E)\bigr) + \frac{9E^2}{16} + 4C(BH + D \cdot H_{kv})$$

$$A_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

Sub-Case: Mixtral 8x22B (MoE)

Detected if blk.0.ffn_gate_exps.weight exists.

$FF$: feed_forward_length
$W_{gate}$: Size of ffn_gate_exps.weight tensor

$$\text{Partial} = \max\Bigl(3W_{gate} + 4B(2FF + H_{kv} + E + C + D_k \cdot H_{kv}),; A_{moe}\Bigr)$$

Where: $$A_{moe} = 4\bigl(C \cdot B \cdot H + C \cdot D_k \cdot H_{kv} + 1024B + D_k \cdot H_{kv} \cdot B\bigr)$$

Sub-Case: Mixtral 8x7B (MoE)

Detected if blk.0.ffn_gate.0.weight exists.

$W_{dim}$: Shape[1] of ffn_gate.0.weight

Full Offload: $$\text{Full} = 4B(2 + 3E + C(1+H) + 2H_{kv} + W_{dim})$$

Partial Offload: $$\text{Partial} = \max\bigl(P_1,; P_2\bigr)$$

Where: $$P_1 = 4B(3 + D_k \cdot H_{kv} + E + C(1+H) + W_{dim}) + \frac{9(E^2 + 3E \cdot H_{kv} \cdot W_{dim})}{16}$$

$$P_2 = 4B(1 + 2E + C(1+H)) + E\Bigl(\frac{6C \cdot H_{kv}}{H} + \frac{9E}{16}\Bigr)$$

Case: "mllama" (Llama 3.2 Vision)

Special KV Logic:

For layers listed in attention.cross_attention_layers, the KV size is overwritten:

$$\text{KV}{vision} = H{kv} \times (D_k + D_v) \times 4 \times 1601 \times 4$$

Where:

$4$ (first): sizeof(float32)
$1601$: vision tokens
$4$ (second): tiles

Graph Overhead:

$$\text{Full} = \max\bigl(4B(2 + 3E + D_k \cdot H + C(1+H)),; 4B(E+V)\bigr)$$

$$\text{Partial} = \max\bigl(P_{attn},; P_{logit}\bigr)$$

Where: $$P_{attn} = 4\bigl(B(2E + 1 + C(1+H) + D_k \cdot H) + R_{freq} + D_k \cdot C \cdot H_{kv}\bigr)$$

$$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

$R_{freq}$: Element count of rope_freqs.weights tensor (0 if not present)

Case: "gemma", "gemma2", "gemma3", "gemma3n"

Base Formula:

$$\text{Full} = \max\bigl(4B(E+V),; 4B(2 + C + CH + 2E + 2D_k H)\bigr)$$

$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$

Where: $$P_{logit} = 4EB + \frac{105 \cdot E \cdot V}{128} + 4VB$$

$$P_{attn} = 4B(2E + 1 + 2D_k H + C + CH) + 4D_k C \cdot 8 + \frac{9 \cdot E \cdot D_k \cdot H}{16}$$

Sub-Case: Gemma 3N

Multiplies both Full and Partial results by 4:

$$\text{Full}{gemma3n} = 4 \times \text{Full}{base}$$ $$\text{Partial}{gemma3n} = 4 \times \text{Partial}{base}$$

Sub-Case: Gemma 3 (Sliding Window)

Special KV Logic: Modifies the Generic Loop.

Global layers: Every 6th layer (where $(i+1) \mod 6 = 0$) uses full context $C$
Sliding layers: All other layers use:

$$C_{sliding} = N_p \times \text{sliding_window} + B$$

For sliding layers: $$\text{KV}{layer}^{(i)} = C{sliding} \times (D_k + D_v) \times H_{kv} \times P$$

Case: "command-r"

$$\text{Full} = \max\bigl(4B(E+V),; 4B(2 + 4E + C(1+H))\bigr)$$

$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$

Where: $$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

$$P_{attn} = 4B(1 + 2E + C(1+H)) + 4EC + \frac{9E^2}{16}$$

Case: "qwen2" (Qwen 2 / 2.5)

$$\text{Full} = \max\bigl(4B(E+V),; 4B(1 + 2E + C + CH)\bigr)$$

$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$

Where: $$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

$$P_{attn} = 4\bigl(B(1 + 2E + C(1+H)) + E(1 + C)\bigr)$$

Case: "phi2"

$$\text{Full} = \max\bigl(4B(E+V),; 4B(1 + 4E + C + CH)\bigr)$$

$$\text{Partial} = \max\bigl(P_1,; P_2\bigr)$$

Where: $$P_1 = 4B(2E + V) + \frac{105 \cdot E \cdot V}{128}$$

$$P_2 = 4B(2 + 3E + C + CH)$$

Case: "stablelm"

$$\text{Full} = 4B(C(1+H) + 3E + 2)$$

$$\text{Partial} = \max\bigl(4B(V + 2E),; \text{Full}\bigr)$$

Case: "deepseek2" (MoE)

$$\text{Full} = \max\bigl(4B(3E+V),; 4B(3E + 2 + C(1+H_{kv}) + 2D_k H_{kv})\bigr)$$

$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$

Where: $$P_{logit} = 4B(3E + V) + \frac{105 \cdot E \cdot V}{128}$$

$$P_{attn} = 4B(2E + 1 + 2D_k H_{kv} + C + C \cdot H_{kv}) + 4D_k C \cdot H_{kv} + \frac{9 \cdot E \cdot D_k \cdot H_{kv}}{16}$$

Case: "chatglm"

Base: $$\text{Full}{base} = 4B(E + V)$$ $$\text{Partial}{base} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$

With attn_qkv.bias present:

Let $S_{bias}$ = Shape[0] of blk.0.attn_qkv.bias

$$\text{Full} = \max\bigl(\text{Full}{base},; 4B(2 + 2E + C + CH + D_k H + S{bias})\bigr)$$

$$\text{Partial} = \max\bigl(\text{Partial}{base},; P{attn}\bigr)$$

Where: $$P_{attn} = 4B(1 + 2E + D_k H + C + CH) + 4D_k C + 4C \cdot D_k + 4S_{bias}$$

Case: "gptoss", "gpt-oss"

Special KV Logic:

This completely overrides the Generic Loop.

For each layer $i$:

$$\text{KV}^{(i)} = (D_k + D_v) \times H_{kv} \times P \times \begin{cases} N_p \times 4096 + B & \text{if } i \mod 2 = 0 \text{ (even)} \ C & \text{if } i \mod 2 = 1 \text{ (odd)} \end{cases}$$

Graph Overhead:

$$\text{Full} = 0 \text{ (not explicitly set)}$$

$$\text{Partial} = \frac{2H}{H_{kv,min}} \times \frac{\text{KV}_{total}}{6}$$

Where $H_{kv,min}$ = minimum KV head count across layers (defaults to 1 if 0).

With Flash Attention enabled:

$$\text{Partial} = \bigl(4N_p + \lfloor C / 1024 \rfloor + 110\bigr) \times 1\text{ MiB}$$

Note: In the flash attention formula, $C$ is the pre-multiplied context ($\text{num_ctx} \times N_p$), so the actual formula expands to: $$\text{Partial} = \bigl(4N_p + \lfloor (\text{num_ctx} \times N_p) / 1024 \rfloor + 110\bigr) \times 1\text{ MiB}$$

4. Helper Functions

KV Cache Bytes Per Element

f16  → 2.0 bytes (default)
q8_0 → 1.0 bytes
q4_0 → 0.5 bytes
f32  → 4.0 bytes (used for recurrent layers)

Flash Attention Support

Models with automatic flash attention enabled:

gemma3
gptoss, gpt-oss
qwen3, qwen3moe
qwen3vl, qwen3vlmoe

Ollama Engine Required

These architectures require the Ollama engine:

gemma3, gemma3n
gptoss, gpt-oss
llama4
mistral3
mllama
qwen25vl
qwen3, qwen3moe
qwen3vl, qwen3vlmoe
deepseekocr
deepseek2
nomic-bert

5. Layer Assignment & Offloading Logic

This section describes how Ollama uses the calculated memory values to decide what goes on GPU vs CPU. This logic is in llm/server.go.

5.1 Graph Size Selection

Ollama chooses between fullOffload and partialOffload based on whether all layers fit:

if (GPU_layers == total_layers):
    graph_overhead = fullOffload
else:
    graph_overhead = partialOffload

Key insight: Even if just ONE layer is on CPU, the larger partialOffload graph overhead is used.

5.2 Fallback for Missing Graph Sizes

If an architecture doesn't have a case in the switch statement (or returns 0):

$$\text{graphPartialOffload}{fallback} = \frac{H}{H{kv,min}} \times \frac{\text{KV}_{total}}{6}$$

$$\text{graphFullOffload}_{fallback} = \text{graphPartialOffload}$$

Where $H_{kv,min}$ defaults to 1 if 0.

5.3 Safety Buffer Reservation

Before calculating layer fits, Ollama reserves a buffer on each GPU:

$$\text{Reserved} = \text{blk.0_weights} + \text{kv}[0]$$

This prevents edge-case OOM errors.

5.4 Layer Assignment Algorithm

The assignLayers function works as follows:

Sort GPUs by free memory (descending)
Calculate per-layer size: layer_size[i] = weights[i] + kv_cache[i]
Reserve overhead per GPU: $$\text{Available} = \text{FreeMemory} - \text{Backoff} - \text{MinimumMemory} - \text{GpuOverhead} - \text{Graph}$$
Greedy assignment (from last layer backwards):
- Start with GPU with most free space
- Assign layers until GPU is full
- Spill to next GPU
- If no GPU has space, remaining layers go to CPU
Output layer handling: If everything doesn't fit, try dropping output layer and retry

5.5 The Split Calculation Formula

To predict the GPU/CPU split percentage:

$$\text{GPU Ratio} = \frac{\text{Layers on GPU}}{\text{Total Layers}}$$

Where layers on GPU is determined by:

$$\text{Available VRAM for weights} = \text{VRAM} - \text{KV Cache} - \text{Graph Overhead} - \text{Buffer}$$

$$\text{Layers on GPU} = \left\lfloor \frac{\text{Available VRAM for weights}}{\text{Average layer size}} \right\rfloor$$

5.6 Complete Memory Budget Equation

For a single GPU setup, the total memory required is:

$$\text{Total} = \underbrace{W_{model}}{\text{Weights}} + \underbrace{\sum{i=0}^{L-1} \text{KV}^{(i)}}{\text{KV Cache}} + \underbrace{\text{Graph}}{\text{Overhead}} + \underbrace{\text{Buffer}}_{\text{Safety}}$$

The split happens when:

$$\text{Total} > \text{VRAM}$$

In this case:

KV Cache is allocated first (always on GPU if layers are on GPU)
Graph overhead is allocated
Remaining VRAM is filled with layer weights
Overflow goes to CPU RAM

5.7 Verification Formula

To verify your observed split:

$$\text{GPU %} = \frac{\text{VRAM} - \text{KV}{total} - \text{Graph}{partial} - \text{Buffer}}{\text{Model Weights}}$$

If result < 1.0, you will see partial CPU offloading.

Example for Command-R 35B @ 32K context on L4 (24GB):

Component	Value
VRAM	24 GB
KV Cache	4.88 GB
Graph (partial)	5.01 GB
Buffer (~1 layer)	~0.5 GB
Available for weights	13.61 GB
Model Weights	19 GB
GPU Ratio	71.6%

6. Memory Estimation Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                         OLLAMA MEMORY ESTIMATION FLOW                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  STEP 1: GraphSize() in ggml.go                                      │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Part A: Generic KV Cache Loop                                 │  │   │
│  │  │  - Iterates all layers                                         │  │   │
│  │  │  - Calculates KV per layer based on attention/recurrent type   │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Part B: Architecture Switch                                   │  │   │
│  │  │  - Calculates fullOffload and partialOffload                   │  │   │
│  │  │  - May override KV cache (gemma3, gptoss, mllama)              │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  Returns: kv[], partialOffload, fullOffload                          │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    ↓                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  STEP 2: Load() in server.go                                         │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Reserve Safety Buffer                                         │  │   │
│  │  │  FreeMemory -= (blk.0 weights + kv[0])                         │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Fallback Graph Size (if architecture returned 0)              │  │   │
│  │  │  graphPartialOffload = (H / H_kv) * kvTotal / 6                │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  createLayout() → buildLayout() → assignLayers()               │  │   │
│  │  │  - Sorts GPUs by free memory                                   │  │   │
│  │  │  - Greedily assigns layers from last→first                     │  │   │
│  │  │  - Spills to CPU when GPU full                                 │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  │                              ↓                                       │   │
│  │  ┌────────────────────────────────────────────────────────────────┐  │   │
│  │  │  Select Final Graph Size                                       │  │   │
│  │  │  if (all layers on GPU) → fullOffload                          │  │   │
│  │  │  else → partialOffload                                         │  │   │
│  │  └────────────────────────────────────────────────────────────────┘  │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    ↓                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  FINAL MEMORY ALLOCATION                                             │   │
│  │                                                                       │   │
│  │  GPU: Σ(layer_weights + layer_kv) for assigned layers + graph        │   │
│  │  CPU: Σ(layer_weights + layer_kv) for remaining layers               │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

7. Code Reference

7.1 GraphSize Function (`fs/ggml/ggml.go`)

func (f GGML) GraphSize(context, batch uint64, numParallel int, kvCacheType string, useFlashAttention bool) (kv []uint64, partialOffload, fullOffload uint64)

Parameters:

context: Base context window size (will be multiplied by numParallel internally)
batch: Batch size
numParallel: Number of parallel sequences
kvCacheType: Cache precision ("f16", "q8_0", "q4_0", "f32")
useFlashAttention: Whether flash attention is enabled

Returns:

kv: Slice of KV cache sizes per layer (in bytes)
partialOffload: Graph overhead for partial GPU offload (in bytes)
fullOffload: Graph overhead for full GPU offload (in bytes)

7.2 Load Function (`llm/server.go`)

func (s *llamaServer) Load(ctx context.Context, systemInfo ml.SystemInfo, gpus []ml.DeviceInfo, requireFull bool) ([]ml.DeviceID, error)

Key operations:

Calls GraphSize() to get memory estimates
Reserves safety buffer (one layer)
Applies fallback formulas if needed
Creates layout via createLayout() → buildLayout() → assignLayers()
Selects final graph size based on offload status

7.3 Key Environment Variables

Variable	Default	Description
`OLLAMA_NUM_PARALLEL`	1	Number of parallel sequences (multiplies context)
`OLLAMA_CONTEXT_LENGTH`	Model default	Context window size
`OLLAMA_FLASH_ATTENTION`	Model-dependent	Enable flash attention
`OLLAMA_KV_CACHE_TYPE`	"f16"	KV cache precision
`OLLAMA_GPU_OVERHEAD`	0	Extra VRAM to reserve (bytes)

Content is user-generated and unverified.

Ollama Memory Estimation Formulas (GraphSize)

1. Legend of Variables

2. Static Memory: KV Cache (The "Generic Loop")

A. Standard Attention Layers (Transformer)

B. Recurrent / SSM Layers (Mamba, etc.)

3. Dynamic Memory: Graph Overhead (The "Switch")

Case: "llama", "llama4" (Includes Mistral/Mixtral)

Sub-Case: Mixtral 8x22B (MoE)

Sub-Case: Mixtral 8x7B (MoE)

Case: "mllama" (Llama 3.2 Vision)

Case: "gemma", "gemma2", "gemma3", "gemma3n"

Sub-Case: Gemma 3N

Sub-Case: Gemma 3 (Sliding Window)

Case: "command-r"

Case: "qwen2" (Qwen 2 / 2.5)

Case: "phi2"

Case: "stablelm"

Case: "deepseek2" (MoE)

Case: "chatglm"

Case: "gptoss", "gpt-oss"

4. Helper Functions

KV Cache Bytes Per Element

Flash Attention Support

Ollama Engine Required

5. Layer Assignment & Offloading Logic

5.1 Graph Size Selection

5.2 Fallback for Missing Graph Sizes

5.3 Safety Buffer Reservation

5.4 Layer Assignment Algorithm

5.5 The Split Calculation Formula

5.6 Complete Memory Budget Equation

5.7 Verification Formula

6. Memory Estimation Flow Diagram

7. Code Reference

7.1 GraphSize Function (fs/ggml/ggml.go)

7.2 Load Function (llm/server.go)

7.3 Key Environment Variables

Ollama Memory Estimation Formulas (`GraphSize`)

7.1 GraphSize Function (`fs/ggml/ggml.go`)

7.2 Load Function (`llm/server.go`)