Optimal Brain Damage (OBD) Summary Notes
Core Concept
Uses 2nd derivative (Hessian) of loss function to estimate impact of removing weights. Removes weights with smallest second derivatives.
Key Problems
1. Independence Assumption
Assumption: δE_total = Σ δE_individual
Reality: Parameters interact through cross terms
2. Cross Terms Ignored
- Mathematical: ∂²E/∂wᵢ∂wⱼ terms (i≠j) in Hessian
- Real meaning: How parameters collaborate/compete
- Examples:
- Redundant CNN filters detecting same edges
- Multiple attention heads learning similar patterns
- Complementary features that only work together
3. Practical Issues
- Expensive Hessian computation
- Local quadratic approximation
- Static analysis (no retraining consideration)
Does OBD Work Today?
Historical value: ✅ Important theoretical foundation
Practical use: ❌ Rarely used in modern practice
Why not:
- Simple magnitude pruning often works as well
- Cross-term interactions are significant in modern networks
- Computational overhead too high
- Better alternatives exist
Modern Alternatives
- Magnitude-based pruning (simpler, effective)
- Structured pruning (remove entire units/channels)
- Iterative pruning with retraining
- Lottery ticket hypothesis methods
Bottom Line
OBD's mathematical elegance is undermined by ignoring parameter interactions. Modern networks have extensive redundancy and cross-dependencies that violate OBD's core assumptions.