Content is user-generated and unverified.

Weight Decay & Structured Sparsity - Conversation Summary

Core Questions Explored

Q1-Q5: Understanding OBD (Optimal Brain Damage) concepts:

Free parameters vs connections: Multiple connections can share parameters (weight sharing, constraints)
VC dimensionality: Measures learning algorithm capacity - higher VC = more complex patterns but more overfitting risk
Parameter count as complexity: More non-zero parameters = more model capacity and overfitting potential
Bias-variance trade-off: Total Error = Training Error + Complexity Penalty
Two literature perspectives: Statistical (theoretical bounds) vs Neural Network (practical heuristics)

Q6-Q8: Weight decay mechanisms:

Weight decay: Adds λ×Σ(w²) penalty, shrinks weights toward zero
Continuous decay with disproportionate rates: Small weights decay faster than proportional to size
Non-proportional/gating: Different decay rates per weight based on importance

Key Discovery: Traditional Weight Decay Limitations

Critical insight: L2 weight decay doesn't generate sparse solutions - it only shrinks weights proportionally without creating true sparsity.

Structured Sparsity Solutions

Types:

Filter/Channel pruning: Remove entire structural units
Block sparsity: Remove rectangular blocks (hardware-friendly)
N:M sparsity: Keep N non-zeros in every M elements (e.g., 2:4)
Group-based patterns: Remove predefined parameter groups

Advantages over unstructured sparsity:

Real hardware acceleration (5.1× CPU, 3.1× GPU speedups)
Contiguous memory access
Compatible with existing dense compute units

Modern Approaches

Group Lasso & Variants:

Apply L2,1 penalty to parameter groups
Can remove entire neurons/filters as units
Better sparsity than traditional weight decay

Recent Innovations (2024-2025):

Differentiable structured sparsity (Spartan, ProxSparse)
Dynamic sparse training with structured constraints
Hardware-software co-design for specific accelerators

Key Papers for Deep Dive

Weight Noise Injection-Based MLPs (IEEE Trans Cybernetics, 2019) - Shows L2 weight decay inadequacy
Novel Pruning Algorithm for Smoothing (IEEE TNNLS, 2018) - Group Lasso vs weight decay comparison
Group Sparse Regularization for DNNs (2016) - Foundational work on group penalties

Bottom Line

Traditional weight decay is insufficient for structured sparsity. Modern approaches use group-based regularization (Group Lasso, sparse group penalties) that can eliminate entire structural units, achieving both compression and real hardware acceleration that element-wise sparsity cannot provide.

Content is user-generated and unverified.