Content is user-generated and unverified.

Strix Halo Architecture & BIOS VRAM Allocation Guide

Overview

AMD's Strix Halo (Ryzen AI MAX+) features a Unified Memory Architecture (UMA) where the CPU, GPU (RDNA 3.5 iGPU), and system RAM all share the same physical memory pool. This is fundamentally different from discrete GPUs that have separate VRAM chips.

The Key Question

Does the BIOS "GPU Memory" or "VRAM Allocation" setting create a hard limit, or is it just a performance optimization?

Answer: It's primarily a performance optimization, not a hard limit.

Understanding Unified Memory Architecture (UMA)

Physical Reality

  • All RAM is physically the same - there is no separate "GPU RAM" chip
  • CPU and GPU access the same DRAM through shared memory controllers
  • The BIOS setting doesn't physically partition the RAM into separate pools

What the BIOS Setting Actually Does

Important: The BIOS doesn't "tell the OS what to do" - it configures the hardware at boot time.

When you set GPU Memory to 64GB, the BIOS/firmware:

  1. Configures the GPU memory controller:
    • Sets up memory apertures (address ranges the GPU can access)
    • Programs the memory controller with specific attributes
    • Configures IOMMU/memory mappings
  2. Sets up cache coherency domains:
    • Configures the allocated range as "coarse-grained" (non-coherent)
    • Sets cache attributes in the hardware
    • Establishes GPU ownership of these memory regions
  3. Reserves physical memory ranges:
    • Marks these ranges in the memory map (ACPI tables, etc.)
    • OS discovers this configuration at boot
    • Creates a guaranteed contiguous memory block
  4. The OS respects this because:
    • It's reading what the hardware was configured to do
    • Not following a "polite suggestion" from BIOS
    • The memory controller is already programmed
    • Changing it would require reprogramming hardware (risky!)

Display Hardware Requirements

  • Modern GPUs need only a few MB for frame buffers
  • The BIOS allocation is NOT primarily for display
  • Even headless systems benefit from coarse-grained memory

Memory Types in HSA/ROCm

AMD's Heterogeneous System Architecture (HSA) defines different memory types based on cache coherency properties:

Fine-Grained Memory (Coherent)

  • Default for shared system memory
  • CPU and GPU can access simultaneously with automatic cache coherency
  • Hardware maintains coherency between CPU and GPU caches
  • Overhead: Every access requires coherency protocol checks
    • "Does the other processor have a dirty cache line?"
    • Constant coherency traffic between CPU and GPU
  • Slower for GPU operations - 2-4x performance penalty
  • Use case: Small shared data structures, coordination between CPU/GPU
  • Think of it like a shared Google Doc - everyone can edit simultaneously, but there's sync overhead

Coarse-Grained Memory (Non-Coherent)

  • What the BIOS allocation creates
  • No automatic cache coherency between CPU and GPU
  • GPU "owns" the memory - CPU shouldn't touch it while GPU is working
  • GPU can cache aggressively without coherency checks
  • Significantly faster for GPU bulk operations (2-4x improvement)
  • Use case: Large buffers for GPU computation (AI model weights, activations)
  • Think of it like a personal file - you own it, work fast, share explicitly when done

The Hardware Implementation

"Coarse-grained" involves multiple hardware subsystems:

  1. MMU/IOMMU Configuration:
    • Sets up address mappings with specific cache attributes
    • Marks regions as non-coherent
    • Establishes ownership domains
  2. Cache Coherency Protocol:
    • Disables automatic coherency for these ranges
    • Allows GPU to cache without CPU synchronization
    • Eliminates coherency traffic overhead
  3. Memory Controller Setup:
    • Configures how memory ranges are accessed
    • Sets up optimal paths for GPU bulk transfers
    • Programs hardware for best GPU performance

The BIOS configures all of this hardware at boot time. The OS discovers this configuration and exposes it to ROCm, but doesn't typically reconfigure it (see "Can the OS Override BIOS?" section below).

Your rocminfo Output Explained

Pool 4
  Segment: GLOBAL; FLAGS: COARSE GRAINED
  Size: 67108864(0x4000000) KB  # 64GB

This shows your BIOS-allocated coarse-grained pool - the fast memory for GPU operations.

Can the OS Override BIOS Settings?

Technically Yes, Practically No

The OS could reconfigure the hardware, but doesn't because:

  1. Lack of vendor-specific knowledge:
    • BIOS/firmware has AMD's proprietary knowledge about:
      • Exact hardware initialization sequences
      • Undocumented registers and their meanings
      • Critical timing requirements
      • Power management implications
      • Hardware errata workarounds specific to chip revisions
  2. High risk of system instability:
    • Reprogramming GPU memory controller incorrectly could:
      • Hang or crash the system
      • Corrupt memory
      • Render the GPU non-functional
      • Require power cycling to recover
    • No way to "undo" if you get it wrong
  3. Complexity of what needs changing:
    • Memory controller configuration
    • IOMMU/MMU remapping with new cache attributes
    • Cache coherency domain reconfiguration
    • GPU aperture reprogramming
    • All of these must be done in the correct order with correct values
  4. Kernel philosophy:
    • "Firmware initializes hardware, kernel uses what it's given"
    • Kernel focuses on managing resources, not low-level hardware init
    • Safer to trust vendor firmware than reverse-engineer initialization

What About Kernel Parameters?

Kernel parameters like amdgpu.gttsize or amdgpu.gart_size adjust software allocations and mappings within the constraints of what firmware configured. They don't reprogram the fundamental hardware setup - they work within it.

The Pragmatic Line

The division between "what firmware does" and "what OS manages" is pragmatic, not fundamental:

  • Firmware: Low-level hardware initialization that requires vendor secrets
  • OS: Resource management within the configured hardware
  • The OS chooses not to reconfigure because the risk/complexity isn't worth it

Could We Set It in Linux?

Theoretically yes, with massive effort:

  1. Reverse engineer AMD's hardware initialization
  2. Write kernel code to safely reconfigure memory controller
  3. Handle all chip revisions and errata
  4. Test exhaustively to avoid bricking systems
  5. Maintain it as hardware evolves

In practice:

  • Just reboot and change it in BIOS
  • Let AMD's firmware do what only AMD knows how to do safely
  • Accept that some hardware config is "firmware's job"

This is why the BIOS setting effectively "enforces" the configuration - not because the OS is being polite, but because the OS wisely chooses not to mess with complex hardware initialization it doesn't fully understand.

Can the GPU Access Memory Beyond BIOS Allocation?

Traditional GPU Code (CUDA-style)

cpp
// Manually allocate on GPU, copy data
cudaMalloc(&gpu_ptr, size);
cudaMemcpy(gpu_ptr, cpu_data, size, cudaMemcpyHostToDevice);
// ... compute ...
cudaMemcpy(result, gpu_ptr, size, cudaMemcpyDeviceToHost);

On Strix Halo: These allocations come from your BIOS-allocated coarse-grained pool for best performance.

Modern UMA-Aware Code

cpp
// Unified memory - accessible by both CPU and GPU
hipMallocManaged(&ptr, size);
// ptr can be used directly by CPU or GPU
// System handles memory placement automatically

On Strix Halo:

  • With kernel 6.16.9+, can access beyond BIOS allocation
  • Falls back to fine-grained shared memory if needed
  • Performance is better within the BIOS allocation

What is hipMallocManaged?

hipMallocManaged is AMD's HIP API function for allocating "managed memory" (also called "unified memory"):

Purpose: Memory accessible by both CPU and GPU without manual copying

On discrete GPUs:

  • Automatically migrates data between separate CPU RAM and GPU VRAM over PCIe
  • Convenient but has migration overhead

On APUs (Strix Halo):

  • Provides unified access to system memory without physical copying
  • Can access memory beyond the BIOS allocation
  • Falls back to fine-grained (coherent) memory when exceeding coarse-grained pool
  • Key trade-off: Enables larger-than-VRAM workloads but with performance penalty

Why the performance difference?

Memory accessed via hipMallocManaged outside the BIOS allocation:

  • Uses fine-grained coherent memory (CPU/GPU caches stay synchronized)
  • Every GPU access requires coherency protocol overhead
  • Can cause cache thrashing between CPU and GPU
  • 2-4x slower than coarse-grained memory for bulk operations

Memory within BIOS allocation:

  • Uses coarse-grained non-coherent memory
  • GPU can cache aggressively without coherency checks
  • Optimal for AI workloads with large sequential transfers
  • This is why most AI applications prefer staying within "VRAM"

Application Behavior Patterns

Applications That Need Large BIOS Allocations

Traditional ML/AI frameworks expecting discrete GPU behavior:

  • PyTorch with standard CUDA-style memory allocation
  • TensorFlow
  • Older Stable Diffusion implementations
  • Most frameworks designed for NVIDIA GPUs
  • Most AI training applications

These check "available VRAM" and limit themselves to that amount, even though more memory is technically accessible.

Applications That Can Work With Small Allocations

UMA-aware applications using hipMallocManaged:

  • Modern llama.cpp with UMA support
  • Applications specifically written for APU unified memory
  • Some newer ROCm-native applications

These can dynamically access system memory beyond BIOS allocation, though with performance trade-offs.

The Kernel Bug (Fixed in 6.16.9+)

The Problem

Kernels 6.15 and earlier had a bug where ROCm could only see ~15.5GB of VRAM regardless of BIOS allocation on Strix Halo systems.

The Fix

Kernel 6.16.9+ includes fixes for:

  • UMA handling for APUs
  • HSA memory pool detection on gfx1151 (Strix Halo)
  • Proper VRAM aperture mapping

Check Your Kernel

bash
uname -r

Note: Proxmox uses custom kernels (pve branch). As of October 2025, Proxmox 6.14.11-pve does NOT include this fix. However, user testing shows that some systems work correctly despite this, suggesting either:

  • Proxmox backported fixes to their 6.14 kernel
  • The bug affects specific configurations, not all Strix Halo systems
  • Different hardware/BIOS versions behave differently

Recommendations by Use Case

For 128GB System Running AI Workloads

General Development & Mixed Workloads:

  • 32-48GB VRAM allocation
  • Leaves 80-96GB for VMs, containers, system
  • Good balance for variety of tasks

Primary AI/ML Training:

  • 64-80GB VRAM allocation
  • Maximizes coarse-grained memory for training
  • Still leaves 48-64GB for system operations
  • Best for serious AI development work

Extreme AI - Running 70B+ Models:

  • 96GB VRAM allocation
  • Maximum performance for large models
  • Leaves 32GB for system (sufficient for Proxmox host)

Headless Server, Minimal AI:

  • 2-4GB VRAM allocation
  • Enough for hardware encoding/transcoding
  • Maximizes RAM for VMs/containers
  • NOT recommended if AI is primary workload

The 512MB Minimum Option

Use 512MB only if:

  • You're NOT running AI workloads
  • You want maximum RAM for VMs
  • You only need iGPU for basic tasks

DO NOT use 512MB if:

  • Running any ML/AI training
  • Using LLMs locally
  • Doing image generation (Stable Diffusion, FLUX, etc.)
  • Running compute workloads on GPU

Performance Implications

Within BIOS Allocation (Coarse-Grained)

  • ✅ Full GPU memory bandwidth
  • ✅ Optimized for bulk transfers
  • ✅ Best performance for training/inference
  • ✅ 2-4x faster than fine-grained

Beyond BIOS Allocation (Fine-Grained via hipMallocManaged)

  • ⚠️ Reduced performance
  • ⚠️ Shared coherent memory overhead
  • ⚠️ Application must explicitly support it
  • ⚠️ May cause cache thrashing

Example from Real Testing

From llama.cpp testing on gfx1103 (similar architecture):

CPU only: 6.47 tokens/second
GPU with fine-grained memory: 6.04 tokens/second (barely faster than CPU!)
GPU with coarse-grained memory: 9.29 tokens/second (43% faster)

The performance difference is significant.

Practical Testing

How to Test if Your Application Needs Large Allocation

  1. Start with small allocation (2-4GB)
  2. Run your workload and monitor:
bash
   watch -n1 rocm-smi
   # or
   watch -n1 'cat /sys/class/drm/card*/device/mem_info_vram_used'
  1. Observe behavior:
    • Application errors about insufficient VRAM → needs larger allocation
    • Application runs but slow → may be using fine-grained fallback
    • Application runs fast → current allocation sufficient
  2. Gradually increase allocation if needed until performance plateaus

Summary & Recommendations

The Technical Truth

What the BIOS setting does:

  • Configures hardware (memory controller, IOMMU, cache coherency) at boot
  • Creates a dedicated coarse-grained (non-coherent) memory pool optimized for GPU
  • OS discovers and respects this hardware configuration
  • Not a "polite suggestion" - it's the actual hardware state

Can GPU access more?

  • Yes - unified memory means GPU can physically access all system RAM
  • But access beyond allocation uses fine-grained (coherent) memory
  • Fine-grained has 2-4x performance penalty due to cache coherency overhead

Can OS override it?

  • Technically possible but practically avoided
  • Requires vendor-specific knowledge OS doesn't have
  • Risk of system instability too high
  • Kernel wisely defers hardware initialization to firmware

Practical Reality

For AI workloads on Strix Halo, treat the BIOS allocation as your working VRAM because:

  • Most AI applications expect and limit themselves to "available VRAM"
  • Performance is significantly better within the coarse-grained pool (2-4x faster)
  • The overhead of fine-grained access negates much of the GPU acceleration
  • Applications using hipMallocManaged beyond allocation often perform worse than staying within it

Your Specific Situation (128GB, AI-focused)

Recommended: 64-80GB BIOS allocation

This gives you:

  • Ample coarse-grained memory for large models
  • OneTrainer will have room to work
  • Still 48-64GB for Proxmox host and VMs
  • Best balance of AI performance and system flexibility

Don't Use 512MB Minimum

Unless you've abandoned AI workloads entirely, 512MB would severely handicap your system's capabilities. With 128GB total RAM, "losing" 64GB to GPU allocation still leaves you with more RAM than most systems have in total.

Key Takeaways

  1. UMA means shared physical RAM - there's no separate GPU memory chip
  2. BIOS configures hardware, not OS - sets up memory controller, cache coherency, IOMMU
  3. Creates coarse-grained pool - non-coherent memory optimized for GPU (2-4x faster)
  4. OS respects hardware config - could override but wisely doesn't (too risky without vendor knowledge)
  5. GPU can access beyond allocation - but uses slow fine-grained (coherent) memory
  6. Applications usually stay within VRAM - either by design or for performance
  7. Performance penalty is severe - coherency overhead makes fine-grained 2-4x slower
  8. Allocate generously for AI - with 128GB, you can afford to optimize for GPU performance

References & Sources

Primary Sources Used

  1. AMD Official Blog - Variable Graphics Memory FAQ
  2. ROCm GitHub Issue #5444 - Strix Halo Memory Visibility
  3. llama.cpp Issue #7399 - UMA Performance

Additional Resources

Note on Technical Details

Some technical explanations (particularly regarding cache coherency, MMU/IOMMU behavior, and why the OS doesn't override BIOS settings) are based on general computer architecture principles rather than Strix Halo-specific documentation. AMD has not published detailed low-level documentation about the memory controller configuration or coherency protocols.

The practical recommendations are based on:

  • Official AMD guidance from their VGM FAQ
  • Community testing and benchmarks
  • General principles of HSA memory architecture
  • Real-world performance data from llama.cpp and ROCm users

Further Resources

Content is user-generated and unverified.
    Strix Halo VRAM Allocation Guide: BIOS Settings & Performance | Claude