Weight Decay & Structured Sparsity - Conversation Summary
Core Questions Explored
Q1-Q5: Understanding OBD (Optimal Brain Damage) concepts:
- Free parameters vs connections: Multiple connections can share parameters (weight sharing, constraints)
- VC dimensionality: Measures learning algorithm capacity - higher VC = more complex patterns but more overfitting risk
- Parameter count as complexity: More non-zero parameters = more model capacity and overfitting potential
- Bias-variance trade-off: Total Error = Training Error + Complexity Penalty
- Two literature perspectives: Statistical (theoretical bounds) vs Neural Network (practical heuristics)
Q6-Q8: Weight decay mechanisms:
- Weight decay: Adds λ×Σ(w²) penalty, shrinks weights toward zero
- Continuous decay with disproportionate rates: Small weights decay faster than proportional to size
- Non-proportional/gating: Different decay rates per weight based on importance
Key Discovery: Traditional Weight Decay Limitations
Critical insight: L2 weight decay doesn't generate sparse solutions - it only shrinks weights proportionally without creating true sparsity.
Structured Sparsity Solutions
Types:
- Filter/Channel pruning: Remove entire structural units
- Block sparsity: Remove rectangular blocks (hardware-friendly)
- N:M sparsity: Keep N non-zeros in every M elements (e.g., 2:4)
- Group-based patterns: Remove predefined parameter groups
Advantages over unstructured sparsity:
- Real hardware acceleration (5.1× CPU, 3.1× GPU speedups)
- Contiguous memory access
- Compatible with existing dense compute units
Modern Approaches
Group Lasso & Variants:
- Apply L2,1 penalty to parameter groups
- Can remove entire neurons/filters as units
- Better sparsity than traditional weight decay
Recent Innovations (2024-2025):
- Differentiable structured sparsity (Spartan, ProxSparse)
- Dynamic sparse training with structured constraints
- Hardware-software co-design for specific accelerators
Key Papers for Deep Dive
- Weight Noise Injection-Based MLPs (IEEE Trans Cybernetics, 2019) - Shows L2 weight decay inadequacy
- Novel Pruning Algorithm for Smoothing (IEEE TNNLS, 2018) - Group Lasso vs weight decay comparison
- Group Sparse Regularization for DNNs (2016) - Foundational work on group penalties
Bottom Line
Traditional weight decay is insufficient for structured sparsity. Modern approaches use group-based regularization (Group Lasso, sparse group penalties) that can eliminate entire structural units, achieving both compression and real hardware acceleration that element-wise sparsity cannot provide.