PR: huggingface/transformers#43830
Author: bozheng-hit (Qwen Team, Alibaba Group)
Status: Open (as of Feb 8, 2026)
Branch: bozheng-hit:qwen3_5 → huggingface:main (17 commits)
This PR adds two new model families to HuggingFace Transformers:
Both models are vision-language models (VLMs) that combine a text backbone with a vision encoder. The key architectural innovation is a hybrid attention mechanism mixing full softmax attention with Gated DeltaNet linear attention.
| Category | Files |
|---|---|
| Dense model | qwen3_5/configuration_qwen3_5.py (307 lines), modular_qwen3_5.py (841 lines), modeling_qwen3_5.py (2194 lines), tokenization_qwen3_5.py (94 lines) |
| MoE model | qwen3_5_moe/configuration_qwen3_5_moe.py (330 lines), modular_qwen3_5_moe.py (464 lines), modeling_qwen3_5_moe.py (2414 lines) |
| Auto classes | configuration_auto.py, modeling_auto.py, processing_auto.py, tokenization_auto.py, image_processing_auto.py, video_processing_auto.py |
| Docs | qwen3_5.md, qwen3_5_moe.md |
| Tests | test_modeling_qwen3_5.py, test_modeling_qwen3_5_moe.py |
| Other | conversion_mapping.py, modeling_rope_utils.py, modeling_flash_attention_utils.py |
Qwen3.5 Dense:
Qwen3_5TextConfig ← Qwen3NextConfig (inherits, deletes MoE params)
Qwen3_5VisionConfig ← Qwen3VLVisionConfig (removes deepstack)
Qwen3_5Config ← Qwen3VLConfig (composite: text + vision)
Qwen3_5GatedDeltaNet ← Qwen3NextGatedDeltaNet (refactored projections)
Qwen3_5Attention ← Qwen3NextAttention
Qwen3_5MLP ← Qwen3NextMLP
Qwen3_5DecoderLayer ← GradientCheckpointingLayer (custom, hybrid dispatch)
Qwen3_5TextModel ← Qwen3NextModel
Qwen3_5Model ← Qwen3VLModel (full VLM)
Qwen3_5ForCausalLM ← Qwen3ForCausalLM
Qwen3_5ForConditionalGeneration ← Qwen3VLForConditionalGeneration
Qwen3.5 MoE:
Qwen3_5MoeTextConfig ← Qwen3NextConfig (keeps MoE params)
Qwen3_5MoeDecoderLayer ← Qwen3NextDecoderLayer (uses SparseMoeBlock)
Qwen3_5MoeForCausalLM ← Qwen3NextForCausalLM
Qwen3_5MoeForConditionalGeneration ← Qwen3VLMoeForConditionalGenerationDefault reference model: Qwen/Qwen3.5-9B-Instruct
| Parameter | Default | Description |
|---|---|---|
vocab_size | 248,320 | Vocabulary size |
hidden_size | 4,096 | Hidden dimension |
intermediate_size | 12,288 | MLP intermediate size (3× hidden) |
num_hidden_layers | 32 | Transformer layers |
num_attention_heads | 16 | Query heads (full attention) |
num_key_value_heads | 4 | KV heads (GQA, 4:1 ratio) |
head_dim | 256 | Attention head dimension |
hidden_act | "silu" | Activation function |
max_position_embeddings | 32,768 | Max sequence length |
rms_norm_eps | 1e-6 | RMSNorm epsilon |
attention_bias | False | No bias in QKV/O projections |
partial_rotary_factor | 0.25 | Only 25% of head_dim uses RoPE |
tie_word_embeddings | False | Separate embed/unembed |
| Parameter | Default | Description |
|---|---|---|
linear_conv_kernel_dim | 4 | Conv1d kernel size for linear attention |
linear_key_head_dim | 128 | Key head dimension in linear layers |
linear_value_head_dim | 128 | Value head dimension in linear layers |
linear_num_key_heads | 16 | Number of key heads (linear attention) |
linear_num_value_heads | 32 | Number of value heads (linear attention) |
By default, full_attention_interval=4, producing this repeating pattern across layers:
Layer 0: linear_attention
Layer 1: linear_attention
Layer 2: linear_attention
Layer 3: full_attention ← every 4th layer
Layer 4: linear_attention
Layer 5: linear_attention
Layer 6: linear_attention
Layer 7: full_attention ← every 4th layer
...Ratio: ~75% linear attention, ~25% full attention (8 full out of 32 layers).
Default reference model: Qwen/Qwen3.5-35B-A3B-Instruct
| Parameter | Default | Description |
|---|---|---|
vocab_size | 248,320 | Same vocabulary |
hidden_size | 2,048 | Smaller hidden dim than dense |
num_hidden_layers | 40 | More layers |
num_attention_heads | 16 | Same head count |
num_key_value_heads | 2 | More aggressive GQA (8:1) |
head_dim | 256 | Same head dimension |
| Parameter | Default | Description |
|---|---|---|
moe_intermediate_size | 512 | Per-expert intermediate size |
shared_expert_intermediate_size | 512 | Shared expert intermediate size |
num_experts | 256 | Total routed experts |
num_experts_per_tok | 8 | Top-K experts per token |
output_router_logits | False | Return router logits for aux loss |
router_aux_loss_coef | 0.001 | Auxiliary loss coefficient |
Naming convention: Qwen3.5-35B-A3B = 35B total params, 3B active per token.
| Parameter | Default | Description |
|---|---|---|
depth | 27 | Vision transformer layers |
hidden_size | 1,152 | ViT hidden dimension |
hidden_act | "gelu_pytorch_tanh" | ViT activation |
intermediate_size | 4,304 | ViT MLP size |
num_heads | 16 | ViT attention heads |
in_channels | 3 | RGB input |
patch_size | 16 | 16×16 patches |
spatial_merge_size | 2 | 2×2 spatial merge |
temporal_patch_size | 2 | Temporal patch for video |
out_hidden_size | 3,584 | Vision→text projection dim |
num_position_embeddings | 2,304 | Max vision positions |
| Parameter | Default | Description |
|---|---|---|
image_token_id | 248,056 | Special token for images |
video_token_id | 248,057 | Special token for videos |
vision_start_token_id | 248,053 | Vision segment start |
vision_end_token_id | 248,054 | Vision segment end |
Each decoder layer is either a full_attention layer (standard softmax attention with GQA + RoPE) or a linear_attention layer (Gated DeltaNet). The DecoderLayer.forward() dispatches based on layer_type:
if self.layer_type == "linear_attention":
hidden_states = self.linear_attn(...)
elif self.layer_type == "full_attention":
hidden_states, _ = self.self_attn(...)The Qwen3_5GatedDeltaNet module implements a recurrent linear attention mechanism:
in_proj_qkv (fused Q/K/V), in_proj_z (gate), in_proj_b (beta), in_proj_a (alpha)A_log and dt_bias learnable parameters (similar to Mamba/S6 style parameterization)g = -A_log.exp() * softplus(a + dt_bias)Qwen3_5DynamicCache extends Qwen3NextDynamicCache to handle both:
conv_states, recurrent_states) for linear attention layersPosition IDs are 3D: (temporal, height, width) — supporting:
mrope_section = [11, 11, 10] defines how the 32 RoPE dim pairs are split across the 3 axesQwen3.5 uses (1 + weight) * x normalization (zero-initialized weights → starts as identity), inherited from Qwen3-Next.
The MoE variant replaces the standard MLP with SparseMoeBlock:
TopKRouter with configurable auxiliary lossgate_up_proj (packed) for experts| Model | Type | Total Params | Active Params | Layers | Hidden | Heads | KV Heads |
|---|---|---|---|---|---|---|---|
| Qwen3.5-9B-Instruct | Dense | ~9B | ~9B | 32 | 4096 | 16 | 4 |
| Qwen3.5-35B-A3B-Instruct | MoE | ~35B | ~3B | 40 | 2048 | 16 | 2 |
intermediate_size in MoE config — replaced by moe_intermediate_size and shared_expert_intermediate_sizeQwen3NextConfig via del self.moe_intermediate_size etc.modular_ pattern — the source of truth is modular_qwen3_5.py, and modeling_qwen3_5.py is auto-generated