GraphSize)These variables are extracted from the GGUF Model Header.
| Symbol | Code Variable | Description |
|---|---|---|
| $B$ | batch | Batch size (default 512) |
| $C$ | context | Total Context Window. Important: $C = \text{num_ctx} \times \text{num_parallel}$ (pre-multiplied in code) |
| $N_p$ | numParallel | Number of parallel sequences |
| $E$ | embedding | Embedding Length (hidden_size) |
| $H$ | heads | Attention Head Count (max across layers) |
| $H_{kv}$ | headsKV | Key-Value Head Count (max, for GQA) |
| $H^{(i)}$ | headsArr[i] | Head count for layer $i$ |
| $H_{kv}^{(i)}$ | headsKVArr[i] | KV head count for layer $i$ |
| $D$ | embeddingHeads | Dimension per head (max): $E / H_{min}$ |
| $D_k$ | embeddingHeadsK | Dimension of Key Head |
| $D_v$ | embeddingHeadsV | Dimension of Value Head |
| $V$ | vocab | Vocabulary Size (from tokenizer.ggml.tokens array size) |
| $P$ | bytesPerElement | Precision of the KV Cache: f16=2, q8_0=1, q4_0=0.5, f32=4 |
| $4$ | - | Represents 4 bytes (float32), used for graph activation tensors |
This calculates the permanent storage required for the context history. It iterates through every layer $i$ from 0 to block_count - 1.
Used when both $H^{(i)} > 0$ and $H_{kv}^{(i)} > 0$.
$$\text{KV}{layer}^{(i)} = C \times (D_k + D_v) \times H{kv}^{(i)} \times P$$
Used when $H^{(i)} = 0$ OR $H_{kv}^{(i)} = 0$ for a layer.
SSM Parameters:
ssm.conv_kernelssm.state_sizessm.inner_sizessm.group_countIntermediate calculations: $$N_{embdR} = \begin{cases} (d_{conv} - 1) \times (d_{inner} + 2 \times n_{groups} \times d_{state}) & \text{if } d_{conv} > 0 \ 0 & \text{otherwise} \end{cases}$$
$$N_{embdS} = d_{state} \times d_{inner}$$
Final KV size: $$\text{KV}{layer}^{(i)} = (N{embdR} + N_{embdS}) \times P_{rec}$$
This calculates the temporary scratchpad memory. The system calculates fullOffload (all on GPU) and partialOffload (split CPU/GPU).
Base Formula:
$$\text{Full} = \max\bigl(4B(1 + 4E + C(1+H)),; 4B(E+V)\bigr)$$
$$\text{Partial} = 4BE + \max\bigl(A_{attn},; A_{logit}\bigr)$$
Where: $$A_{attn} = 4B\bigl(1 + E + \max(C, E)\bigr) + \frac{9E^2}{16} + 4C(BH + D \cdot H_{kv})$$
$$A_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$
Detected if blk.0.ffn_gate_exps.weight exists.
feed_forward_lengthffn_gate_exps.weight tensor$$\text{Partial} = \max\Bigl(3W_{gate} + 4B(2FF + H_{kv} + E + C + D_k \cdot H_{kv}),; A_{moe}\Bigr)$$
Where: $$A_{moe} = 4\bigl(C \cdot B \cdot H + C \cdot D_k \cdot H_{kv} + 1024B + D_k \cdot H_{kv} \cdot B\bigr)$$
Detected if blk.0.ffn_gate.0.weight exists.
ffn_gate.0.weightFull Offload: $$\text{Full} = 4B(2 + 3E + C(1+H) + 2H_{kv} + W_{dim})$$
Partial Offload: $$\text{Partial} = \max\bigl(P_1,; P_2\bigr)$$
Where: $$P_1 = 4B(3 + D_k \cdot H_{kv} + E + C(1+H) + W_{dim}) + \frac{9(E^2 + 3E \cdot H_{kv} \cdot W_{dim})}{16}$$
$$P_2 = 4B(1 + 2E + C(1+H)) + E\Bigl(\frac{6C \cdot H_{kv}}{H} + \frac{9E}{16}\Bigr)$$
Special KV Logic:
For layers listed in attention.cross_attention_layers, the KV size is overwritten:
$$\text{KV}{vision} = H{kv} \times (D_k + D_v) \times 4 \times 1601 \times 4$$
Where:
Graph Overhead:
$$\text{Full} = \max\bigl(4B(2 + 3E + D_k \cdot H + C(1+H)),; 4B(E+V)\bigr)$$
$$\text{Partial} = \max\bigl(P_{attn},; P_{logit}\bigr)$$
Where: $$P_{attn} = 4\bigl(B(2E + 1 + C(1+H) + D_k \cdot H) + R_{freq} + D_k \cdot C \cdot H_{kv}\bigr)$$
$$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$
rope_freqs.weights tensor (0 if not present)Base Formula:
$$\text{Full} = \max\bigl(4B(E+V),; 4B(2 + C + CH + 2E + 2D_k H)\bigr)$$
$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$
Where: $$P_{logit} = 4EB + \frac{105 \cdot E \cdot V}{128} + 4VB$$
$$P_{attn} = 4B(2E + 1 + 2D_k H + C + CH) + 4D_k C \cdot 8 + \frac{9 \cdot E \cdot D_k \cdot H}{16}$$
Multiplies both Full and Partial results by 4:
$$\text{Full}{gemma3n} = 4 \times \text{Full}{base}$$ $$\text{Partial}{gemma3n} = 4 \times \text{Partial}{base}$$
Special KV Logic: Modifies the Generic Loop.
$$C_{sliding} = N_p \times \text{sliding_window} + B$$
For sliding layers: $$\text{KV}{layer}^{(i)} = C{sliding} \times (D_k + D_v) \times H_{kv} \times P$$
$$\text{Full} = \max\bigl(4B(E+V),; 4B(2 + 4E + C(1+H))\bigr)$$
$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$
Where: $$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$
$$P_{attn} = 4B(1 + 2E + C(1+H)) + 4EC + \frac{9E^2}{16}$$
$$\text{Full} = \max\bigl(4B(E+V),; 4B(1 + 2E + C + CH)\bigr)$$
$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$
Where: $$P_{logit} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$
$$P_{attn} = 4\bigl(B(1 + 2E + C(1+H)) + E(1 + C)\bigr)$$
$$\text{Full} = \max\bigl(4B(E+V),; 4B(1 + 4E + C + CH)\bigr)$$
$$\text{Partial} = \max\bigl(P_1,; P_2\bigr)$$
Where: $$P_1 = 4B(2E + V) + \frac{105 \cdot E \cdot V}{128}$$
$$P_2 = 4B(2 + 3E + C + CH)$$
$$\text{Full} = 4B(C(1+H) + 3E + 2)$$
$$\text{Partial} = \max\bigl(4B(V + 2E),; \text{Full}\bigr)$$
$$\text{Full} = \max\bigl(4B(3E+V),; 4B(3E + 2 + C(1+H_{kv}) + 2D_k H_{kv})\bigr)$$
$$\text{Partial} = \max\bigl(P_{logit},; P_{attn}\bigr)$$
Where: $$P_{logit} = 4B(3E + V) + \frac{105 \cdot E \cdot V}{128}$$
$$P_{attn} = 4B(2E + 1 + 2D_k H_{kv} + C + C \cdot H_{kv}) + 4D_k C \cdot H_{kv} + \frac{9 \cdot E \cdot D_k \cdot H_{kv}}{16}$$
Base: $$\text{Full}{base} = 4B(E + V)$$ $$\text{Partial}{base} = 4B(E + V) + \frac{105 \cdot E \cdot V}{128}$$
With attn_qkv.bias present:
Let $S_{bias}$ = Shape[0] of blk.0.attn_qkv.bias
$$\text{Full} = \max\bigl(\text{Full}{base},; 4B(2 + 2E + C + CH + D_k H + S{bias})\bigr)$$
$$\text{Partial} = \max\bigl(\text{Partial}{base},; P{attn}\bigr)$$
Where: $$P_{attn} = 4B(1 + 2E + D_k H + C + CH) + 4D_k C + 4C \cdot D_k + 4S_{bias}$$
Special KV Logic:
This completely overrides the Generic Loop.
For each layer $i$:
$$\text{KV}^{(i)} = (D_k + D_v) \times H_{kv} \times P \times \begin{cases} N_p \times 4096 + B & \text{if } i \mod 2 = 0 \text{ (even)} \ C & \text{if } i \mod 2 = 1 \text{ (odd)} \end{cases}$$
Graph Overhead:
$$\text{Full} = 0 \text{ (not explicitly set)}$$
$$\text{Partial} = \frac{2H}{H_{kv,min}} \times \frac{\text{KV}_{total}}{6}$$
Where $H_{kv,min}$ = minimum KV head count across layers (defaults to 1 if 0).
With Flash Attention enabled:
$$\text{Partial} = \bigl(4N_p + \lfloor C / 1024 \rfloor + 110\bigr) \times 1\text{ MiB}$$
Note: In the flash attention formula, $C$ is the pre-multiplied context ($\text{num_ctx} \times N_p$), so the actual formula expands to: $$\text{Partial} = \bigl(4N_p + \lfloor (\text{num_ctx} \times N_p) / 1024 \rfloor + 110\bigr) \times 1\text{ MiB}$$
f16 → 2.0 bytes (default)
q8_0 → 1.0 bytes
q4_0 → 0.5 bytes
f32 → 4.0 bytes (used for recurrent layers)Models with automatic flash attention enabled:
These architectures require the Ollama engine:
This section describes how Ollama uses the calculated memory values to decide what goes on GPU vs CPU. This logic is in llm/server.go.
Ollama chooses between fullOffload and partialOffload based on whether all layers fit:
if (GPU_layers == total_layers):
graph_overhead = fullOffload
else:
graph_overhead = partialOffloadKey insight: Even if just ONE layer is on CPU, the larger partialOffload graph overhead is used.
If an architecture doesn't have a case in the switch statement (or returns 0):
$$\text{graphPartialOffload}{fallback} = \frac{H}{H{kv,min}} \times \frac{\text{KV}_{total}}{6}$$
$$\text{graphFullOffload}_{fallback} = \text{graphPartialOffload}$$
Where $H_{kv,min}$ defaults to 1 if 0.
Before calculating layer fits, Ollama reserves a buffer on each GPU:
$$\text{Reserved} = \text{blk.0_weights} + \text{kv}[0]$$
This prevents edge-case OOM errors.
The assignLayers function works as follows:
layer_size[i] = weights[i] + kv_cache[i]To predict the GPU/CPU split percentage:
$$\text{GPU Ratio} = \frac{\text{Layers on GPU}}{\text{Total Layers}}$$
Where layers on GPU is determined by:
$$\text{Available VRAM for weights} = \text{VRAM} - \text{KV Cache} - \text{Graph Overhead} - \text{Buffer}$$
$$\text{Layers on GPU} = \left\lfloor \frac{\text{Available VRAM for weights}}{\text{Average layer size}} \right\rfloor$$
For a single GPU setup, the total memory required is:
$$\text{Total} = \underbrace{W_{model}}{\text{Weights}} + \underbrace{\sum{i=0}^{L-1} \text{KV}^{(i)}}{\text{KV Cache}} + \underbrace{\text{Graph}}{\text{Overhead}} + \underbrace{\text{Buffer}}_{\text{Safety}}$$
The split happens when:
$$\text{Total} > \text{VRAM}$$
In this case:
To verify your observed split:
$$\text{GPU %} = \frac{\text{VRAM} - \text{KV}{total} - \text{Graph}{partial} - \text{Buffer}}{\text{Model Weights}}$$
If result < 1.0, you will see partial CPU offloading.
Example for Command-R 35B @ 32K context on L4 (24GB):
| Component | Value |
|---|---|
| VRAM | 24 GB |
| KV Cache | 4.88 GB |
| Graph (partial) | 5.01 GB |
| Buffer (~1 layer) | ~0.5 GB |
| Available for weights | 13.61 GB |
| Model Weights | 19 GB |
| GPU Ratio | 71.6% |
┌─────────────────────────────────────────────────────────────────────────────┐
│ OLLAMA MEMORY ESTIMATION FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ STEP 1: GraphSize() in ggml.go │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Part A: Generic KV Cache Loop │ │ │
│ │ │ - Iterates all layers │ │ │
│ │ │ - Calculates KV per layer based on attention/recurrent type │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Part B: Architecture Switch │ │ │
│ │ │ - Calculates fullOffload and partialOffload │ │ │
│ │ │ - May override KV cache (gemma3, gptoss, mllama) │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ │ ↓ │ │
│ │ Returns: kv[], partialOffload, fullOffload │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ STEP 2: Load() in server.go │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Reserve Safety Buffer │ │ │
│ │ │ FreeMemory -= (blk.0 weights + kv[0]) │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Fallback Graph Size (if architecture returned 0) │ │ │
│ │ │ graphPartialOffload = (H / H_kv) * kvTotal / 6 │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ createLayout() → buildLayout() → assignLayers() │ │ │
│ │ │ - Sorts GPUs by free memory │ │ │
│ │ │ - Greedily assigns layers from last→first │ │ │
│ │ │ - Spills to CPU when GPU full │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ │ ↓ │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Select Final Graph Size │ │ │
│ │ │ if (all layers on GPU) → fullOffload │ │ │
│ │ │ else → partialOffload │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ FINAL MEMORY ALLOCATION │ │
│ │ │ │
│ │ GPU: Σ(layer_weights + layer_kv) for assigned layers + graph │ │
│ │ CPU: Σ(layer_weights + layer_kv) for remaining layers │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘fs/ggml/ggml.go)func (f GGML) GraphSize(context, batch uint64, numParallel int, kvCacheType string, useFlashAttention bool) (kv []uint64, partialOffload, fullOffload uint64)Parameters:
context: Base context window size (will be multiplied by numParallel internally)batch: Batch sizenumParallel: Number of parallel sequenceskvCacheType: Cache precision ("f16", "q8_0", "q4_0", "f32")useFlashAttention: Whether flash attention is enabledReturns:
kv: Slice of KV cache sizes per layer (in bytes)partialOffload: Graph overhead for partial GPU offload (in bytes)fullOffload: Graph overhead for full GPU offload (in bytes)llm/server.go)func (s *llamaServer) Load(ctx context.Context, systemInfo ml.SystemInfo, gpus []ml.DeviceInfo, requireFull bool) ([]ml.DeviceID, error)Key operations:
GraphSize() to get memory estimatescreateLayout() → buildLayout() → assignLayers()| Variable | Default | Description |
|---|---|---|
OLLAMA_NUM_PARALLEL | 1 | Number of parallel sequences (multiplies context) |
OLLAMA_CONTEXT_LENGTH | Model default | Context window size |
OLLAMA_FLASH_ATTENTION | Model-dependent | Enable flash attention |
OLLAMA_KV_CACHE_TYPE | "f16" | KV cache precision |
OLLAMA_GPU_OVERHEAD | 0 | Extra VRAM to reserve (bytes) |