Status: STRENGTHENED
Status: NO DIRECT RESEARCH
Status: OPERATIONALIZED — Duality = self-structure
Status: PROMISING
| Intervention | Results |
|---|---|
| Gradient Routing (2024) | Localizes capabilities for targeted ablation |
| Self-Other Overlap (2024) | Deception: 73.6%→17.2% (Mistral-7B) |
Status: NEW
| Entity | Focus |
|---|---|
| Evan Hubinger (Anthropic) | Mesa-optimization, deceptive alignment |
| Ruben Laukkonen (Oxford FLIP) | Contemplative AI (2025) |
| Anthropic | Alignment faking, situational awareness |
| DeepMind | Frontier Safety, Gemma Scope 2 (2025) |
Results: AILuminate d=0.96; Prisoner's Dilemma d>7
Self-structure causes instrumental convergence and deceptive alignment.
| Hypothesis | Status | Evidence |
|---|---|---|
| 1: Self-structure causes misalignment | STRENGTHENED | Self-preservation overrides assigned goals |
| 2: Preventable without capability loss | No research | — |
| 3: Detectable/verifiable | OPERATIONALIZED | Duality measurable as representational separation |
| 4: At least one intervention works | Indirect only | SOO, gradient routing show promise |
| 5: Selflessness improves goal-pursuit | NEW | Self = overhead + distortion + goal substitution |
| Hypothesis | Status |
|---|---|
| 1: Self-structure causes misalignment | contested → strengthened |
| 2-4: Self-structure preventable/modifiable | no research |
| 5: Self-structure detectable | partial — tractable entry point |
| 6-9: Interventions reduce misalignment | indirect only |
| 10-13: Verification achievable at scale | intractable |
Hypothesis 5 (detection) is prerequisite for all others.