Content is user-generated and unverified.

Qwen3.5 Architecture Analysis — PR #43830

PR: huggingface/transformers#43830
Author: bozheng-hit (Qwen Team, Alibaba Group)
Status: Open (as of Feb 8, 2026)
Branch: bozheng-hit:qwen3_5huggingface:main (17 commits)


Overview

This PR adds two new model families to HuggingFace Transformers:

  1. Qwen3.5 (dense) — multimodal VLM with hybrid attention
  2. Qwen3.5-MoE — same architecture but with Mixture-of-Experts MLP layers

Both models are vision-language models (VLMs) that combine a text backbone with a vision encoder. The key architectural innovation is a hybrid attention mechanism mixing full softmax attention with Gated DeltaNet linear attention.


Files Changed (28 files)

CategoryFiles
Dense modelqwen3_5/configuration_qwen3_5.py (307 lines), modular_qwen3_5.py (841 lines), modeling_qwen3_5.py (2194 lines), tokenization_qwen3_5.py (94 lines)
MoE modelqwen3_5_moe/configuration_qwen3_5_moe.py (330 lines), modular_qwen3_5_moe.py (464 lines), modeling_qwen3_5_moe.py (2414 lines)
Auto classesconfiguration_auto.py, modeling_auto.py, processing_auto.py, tokenization_auto.py, image_processing_auto.py, video_processing_auto.py
Docsqwen3_5.md, qwen3_5_moe.md
Teststest_modeling_qwen3_5.py, test_modeling_qwen3_5_moe.py
Otherconversion_mapping.py, modeling_rope_utils.py, modeling_flash_attention_utils.py

Inheritance Hierarchy

Qwen3.5 Dense:
  Qwen3_5TextConfig  ← Qwen3NextConfig (inherits, deletes MoE params)
  Qwen3_5VisionConfig ← Qwen3VLVisionConfig (removes deepstack)
  Qwen3_5Config       ← Qwen3VLConfig (composite: text + vision)
  
  Qwen3_5GatedDeltaNet  ← Qwen3NextGatedDeltaNet (refactored projections)
  Qwen3_5Attention      ← Qwen3NextAttention
  Qwen3_5MLP            ← Qwen3NextMLP
  Qwen3_5DecoderLayer   ← GradientCheckpointingLayer (custom, hybrid dispatch)
  Qwen3_5TextModel      ← Qwen3NextModel
  Qwen3_5Model          ← Qwen3VLModel (full VLM)
  Qwen3_5ForCausalLM    ← Qwen3ForCausalLM
  Qwen3_5ForConditionalGeneration ← Qwen3VLForConditionalGeneration

Qwen3.5 MoE:
  Qwen3_5MoeTextConfig  ← Qwen3NextConfig (keeps MoE params)
  Qwen3_5MoeDecoderLayer ← Qwen3NextDecoderLayer (uses SparseMoeBlock)
  Qwen3_5MoeForCausalLM  ← Qwen3NextForCausalLM
  Qwen3_5MoeForConditionalGeneration ← Qwen3VLMoeForConditionalGeneration

Qwen3.5 Dense Text Configuration

Default reference model: Qwen/Qwen3.5-9B-Instruct

ParameterDefaultDescription
vocab_size248,320Vocabulary size
hidden_size4,096Hidden dimension
intermediate_size12,288MLP intermediate size (3× hidden)
num_hidden_layers32Transformer layers
num_attention_heads16Query heads (full attention)
num_key_value_heads4KV heads (GQA, 4:1 ratio)
head_dim256Attention head dimension
hidden_act"silu"Activation function
max_position_embeddings32,768Max sequence length
rms_norm_eps1e-6RMSNorm epsilon
attention_biasFalseNo bias in QKV/O projections
partial_rotary_factor0.25Only 25% of head_dim uses RoPE
tie_word_embeddingsFalseSeparate embed/unembed

Linear Attention Parameters (Gated DeltaNet)

ParameterDefaultDescription
linear_conv_kernel_dim4Conv1d kernel size for linear attention
linear_key_head_dim128Key head dimension in linear layers
linear_value_head_dim128Value head dimension in linear layers
linear_num_key_heads16Number of key heads (linear attention)
linear_num_value_heads32Number of value heads (linear attention)

Layer Type Pattern

By default, full_attention_interval=4, producing this repeating pattern across layers:

Layer  0: linear_attention
Layer  1: linear_attention  
Layer  2: linear_attention
Layer  3: full_attention    ← every 4th layer
Layer  4: linear_attention
Layer  5: linear_attention
Layer  6: linear_attention
Layer  7: full_attention    ← every 4th layer
...

Ratio: ~75% linear attention, ~25% full attention (8 full out of 32 layers).


Qwen3.5-MoE Text Configuration

Default reference model: Qwen/Qwen3.5-35B-A3B-Instruct

ParameterDefaultDescription
vocab_size248,320Same vocabulary
hidden_size2,048Smaller hidden dim than dense
num_hidden_layers40More layers
num_attention_heads16Same head count
num_key_value_heads2More aggressive GQA (8:1)
head_dim256Same head dimension

MoE-Specific Parameters

ParameterDefaultDescription
moe_intermediate_size512Per-expert intermediate size
shared_expert_intermediate_size512Shared expert intermediate size
num_experts256Total routed experts
num_experts_per_tok8Top-K experts per token
output_router_logitsFalseReturn router logits for aux loss
router_aux_loss_coef0.001Auxiliary loss coefficient

Naming convention: Qwen3.5-35B-A3B = 35B total params, 3B active per token.


Vision Configuration (shared by both variants)

ParameterDefaultDescription
depth27Vision transformer layers
hidden_size1,152ViT hidden dimension
hidden_act"gelu_pytorch_tanh"ViT activation
intermediate_size4,304ViT MLP size
num_heads16ViT attention heads
in_channels3RGB input
patch_size1616×16 patches
spatial_merge_size22×2 spatial merge
temporal_patch_size2Temporal patch for video
out_hidden_size3,584Vision→text projection dim
num_position_embeddings2,304Max vision positions

VLM Composite Config

ParameterDefaultDescription
image_token_id248,056Special token for images
video_token_id248,057Special token for videos
vision_start_token_id248,053Vision segment start
vision_end_token_id248,054Vision segment end

Key Architectural Features

1. Hybrid Linear + Full Attention

Each decoder layer is either a full_attention layer (standard softmax attention with GQA + RoPE) or a linear_attention layer (Gated DeltaNet). The DecoderLayer.forward() dispatches based on layer_type:

python
if self.layer_type == "linear_attention":
    hidden_states = self.linear_attn(...)
elif self.layer_type == "full_attention":
    hidden_states, _ = self.self_attn(...)

2. Gated DeltaNet (Linear Attention)

The Qwen3_5GatedDeltaNet module implements a recurrent linear attention mechanism:

  • Separate projections: in_proj_qkv (fused Q/K/V), in_proj_z (gate), in_proj_b (beta), in_proj_a (alpha)
  • Causal Conv1d on Q/K/V before splitting (kernel size 4)
  • Chunk-wise gated delta rule for efficient training; single-step recurrent mode for generation
  • Uses A_log and dt_bias learnable parameters (similar to Mamba/S6 style parameterization)
  • Key formula: g = -A_log.exp() * softplus(a + dt_bias)

3. Custom DynamicCache

Qwen3_5DynamicCache extends Qwen3NextDynamicCache to handle both:

  • Standard KV cache for full attention layers
  • Recurrent state (conv_states, recurrent_states) for linear attention layers

4. Multimodal RoPE (M-RoPE)

Position IDs are 3D: (temporal, height, width) — supporting:

  • Text-only sequences (identical across all 3 dims)
  • Images/videos (spatial 2D + temporal position encoding)
  • mrope_section = [11, 11, 10] defines how the 32 RoPE dim pairs are split across the 3 axes
  • Only 25% partial rotary factor (64 out of 256 head_dim gets RoPE)

5. RMSNorm with Offset

Qwen3.5 uses (1 + weight) * x normalization (zero-initialized weights → starts as identity), inherited from Qwen3-Next.

6. MoE Architecture (MoE variant)

The MoE variant replaces the standard MLP with SparseMoeBlock:

  • 256 routed experts with Top-K=8 routing
  • Plus a shared expert always active
  • TopKRouter with configurable auxiliary loss
  • Fused gate_up_proj (packed) for experts

Model Variants Available

ModelTypeTotal ParamsActive ParamsLayersHiddenHeadsKV Heads
Qwen3.5-9B-InstructDense~9B~9B324096164
Qwen3.5-35B-A3B-InstructMoE~35B~3B402048162

Notable Design Decisions

  1. Builds on Qwen3-Next (already merged via PR #40771), which itself is a hybrid attention model — Qwen3.5 refactors the projections and removes MoE params for the dense variant
  2. Vision encoder is inherited from Qwen3-VL but removes DeepStack (multi-layer visual feature injection)
  3. No intermediate_size in MoE config — replaced by moe_intermediate_size and shared_expert_intermediate_size
  4. The dense model deletes MoE-related attributes inherited from Qwen3NextConfig via del self.moe_intermediate_size etc.
  5. Uses the modular_ pattern — the source of truth is modular_qwen3_5.py, and modeling_qwen3_5.py is auto-generated
Content is user-generated and unverified.
    Qwen3.5 Architecture Analysis: Hybrid Attention VLM | Claude