Content is user-generated and unverified.

Optimal Brain Damage (OBD) Summary Notes

Core Concept

Uses 2nd derivative (Hessian) of loss function to estimate impact of removing weights. Removes weights with smallest second derivatives.

Key Problems

1. Independence Assumption

Assumption: δE_total = Σ δE_individual
Reality: Parameters interact through cross terms

2. Cross Terms Ignored

Mathematical: ∂²E/∂wᵢ∂wⱼ terms (i≠j) in Hessian
Real meaning: How parameters collaborate/compete
Examples:
- Redundant CNN filters detecting same edges
- Multiple attention heads learning similar patterns
- Complementary features that only work together

3. Practical Issues

Expensive Hessian computation
Local quadratic approximation
Static analysis (no retraining consideration)

Does OBD Work Today?

Historical value: ✅ Important theoretical foundation
Practical use: ❌ Rarely used in modern practice

Why not:

Simple magnitude pruning often works as well
Cross-term interactions are significant in modern networks
Computational overhead too high
Better alternatives exist

Modern Alternatives

Magnitude-based pruning (simpler, effective)
Structured pruning (remove entire units/channels)
Iterative pruning with retraining
Lottery ticket hypothesis methods

Bottom Line

OBD's mathematical elegance is undermined by ignoring parameter interactions. Modern networks have extensive redundancy and cross-dependencies that violate OBD's core assumptions.

Content is user-generated and unverified.