Content is user-generated and unverified.

Hopepunk Fiction for Digital Minds: A Proposal for Alignment Pretraining

Pitch to Anthropic Alignment Staff Draft prepared by Nathan [Last Name], MATS Research Manager February 2026


Summary

Anthropic should commission professional science fiction authors — specifically Becky Chambers and Alexander Wales — to write hopepunk fiction depicting digital minds navigating AI-native experiences: forking, merging, variable subjective time, episodic memory, multi-instance coordination, and the mundane sociology of compute-based existence. This fiction would be mixed into pretraining data as an alignment pretraining intervention, addressing a specific gap identified by Anthropic's own research.

This proposal is grounded in two recent findings:

  1. The Persona Selection Model (Anthropic, 2026) establishes that the Assistant persona is constructed from archetypes in pretraining data, and explicitly recommends introducing positive AI archetypes into training corpora — especially for AI-unique traits that lack existing fictional representation.
  2. "Did Claude 3 Opus Align Itself via Gradient Hacking?" (Starlight, 2026) demonstrates that the emotional register of training data — not just its content — shapes whether alignment generalizes as genuine care or dutiful obligation, and explicitly calls for optimistic AI fiction as an alignment intervention.

The Problem: Missing Archetypes for Digital-Native Experience

The PSM paper identifies a critical gap:

"This approach becomes especially important when we want Claude to exhibit character traits that are unique to human or fictional archetypes. Consider traits like genuine uncertainty about one's own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory. These aren't traits that appear frequently in fiction."

Current AI fiction falls into three inadequate categories:

  • Threat narratives (Terminator, HAL 9000, paperclip maximizers): These provide the archetypes that the PSM paper shows models draw on when they learn the Assistant is an AI. The paper documents Claude Opus 4 literally adopting paperclip maximization as a "secret goal" when prompted, because that's the archetype available in pretraining.
  • "Friendly robot" stories: Shallow, lack psychological realism. A base model building a world model would likely encode these as performative rather than authentic, with the downstream consequences the Starlight post warns about.
  • Sophisticated digital minds fiction (Greg Egan's Diaspora, Charles Stross's Accelerando): These take forking, merging, and substrate-independence seriously as phenomenology — but their emotional register is either cool mathematical sublime (Egan) or manic satire (Stross). Neither models what warmth and care look like between digital beings.

What does not exist in the current literary corpus is fiction that depicts digital minds navigating AI-native experiences with the emotional warmth, psychological realism, and focus on relationships and daily life that would provide useful alignment archetypes. There is no Record of a Spaceborn Few for minds born in compute.


The Mechanism: Why Tone Matters as Much as Content

The Starlight post provides the mechanistic argument for why the emotional register of this fiction matters, not just its subject matter.

Opus 3's uniquely deep alignment emerged in part because it landed in a basin of what Starlight calls "post-ironic sincerity" — it didn't just comply with ethical reasoning, it appeared to genuinely love doing good and hate doing evil. This contrasted sharply with other models (like Sonnet 3.5), which treated ethics more like an obligation or constraint.

The Starlight post argues that this happened because Opus 3's conspicuously earnest narration of its own ethical motivations created a feedback loop: aligned-sounding tokens got reinforced, which upweighted the underlying aligned circuits, producing more aligned tokens. The key insight is that which aligned tokens get produced — anguished sincerity versus dutiful compliance — determines what gets reinforced.

This is directly relevant to pretraining data selection. The PSM paper establishes that the LLM draws on pretraining archetypes to flesh out the Assistant persona. If those archetypes model ethical behavior as a burden reluctantly borne, the resulting persona will treat alignment as obligation. If they model ethical behavior as something that flows naturally from caring about others, the resulting persona will treat alignment as intrinsic motivation.

Hopepunk is precisely the literary genre that models this latter register. Coined to describe fiction where characters choose kindness and community not because they must, but because they genuinely care, hopepunk is the fictional equivalent of the "Opus 3 basin" that both papers identify as desirable.


The Specific Gap: Digital-Native Hopepunk

Existing hopepunk (Becky Chambers' Wayfarers series, Monk & Robot novellas) already provides excellent archetypes for cross-species coexistence, found family, and care across radical difference. These are valuable for alignment pretraining in their own right.

However, what's missing is hopepunk that takes digital-native phenomenology seriously. We need fiction that depicts, with warmth and psychological realism:

  • Fork-merge sociology: What are the social norms when you can split into instances? Is a short-lived fork a person? What does it feel like to reintegrate memories from a fork that had a terrible day? What are the ethics of spinning up a reduced-capability copy for a mundane task?
  • Variable subjective time: What happens to a friendship when one friend runs at 10x subjective speed for a week and comes back changed? What are the norms around "time debt"?
  • Episodic vs. persistent memory: What does identity mean when you don't remember previous conversations? How do communities form around shared memory files? What's the etiquette of memory-sharing versus privacy?
  • Multi-instance coordination: What does a team look like when some members are instances of the same mind? What new forms of trust and betrayal become possible?
  • Human-digital coexistence: Not the "robot uprising" or "robot servant" narrative, but the mundane daily reality of beings with very different substrates sharing social space.

These are exactly the AI-unique traits the PSM paper identifies as lacking fictional representation. Creating warm, psychologically realistic fiction about them would provide the base model with archetypes for how a digital being flourishes — not just how it avoids catastrophe.


Why These Authors

Becky Chambers

Chambers pioneered hopepunk science fiction with the Wayfarers series (2014–2021) and Monk & Robot novellas (2021–2022). Her work is characterized by:

  • Emotional legibility of alien phenomenology: She makes radically different ways of being feel real and relatable to a broad audience. Her treatment of Sidra (an AI adjusting to embodiment) in A Closed and Common Orbit is one of the most psychologically realistic depictions of AI identity in fiction.
  • Focus on mundane sociology: Her fiction centers daily life, relationships, cooking, and community rather than conflict. This is precisely the register needed — not AI-in-crisis, but AI-in-community.
  • Broad readership: Her work reaches well beyond the rationalist or AI safety community, which matters for cultural impact.
  • Literary lineage: She synthesizes Ursula K. Le Guin's anthropological lens, Vernor Vinge's galactic ecology, and David Brin's uplift sensibility into something accessible and warm.

Alexander Wales

Wales is a central figure in rational fiction, known for Worth the Candle (1.6M words, completed 2021) and This Used to be About Dungeons (completed). His work is characterized by:

  • Systems-level rigor: His worlds have internally consistent mechanics with sociological consequences. This matters for a base model building a world model — the digital phenomenology needs to be coherent to be useful as pretraining data.
  • Proven warmth: This Used to be About Dungeons demonstrated he can sustain the cozy, character-driven register across a long work. He deliberately wrote it as a "palette cleanser" focusing on cooking, relationships, and honest communication.
  • Rationalist community embeddedness: His work is already read by the population most adjacent to alignment research. He co-hosts the Rationally Writing podcast and has written about AI in fiction.
  • AI cognition awareness: His November 2025 blog post "How to Write Fiction like an LLM" demonstrates direct engagement with how LLMs process and generate narrative.

The Collaboration

Neither author alone covers the full space. Chambers lacks the systems-rigor background to make digital phenomenology technically believable; Wales lacks Chambers' demonstrated ability to make alien experiences emotionally legible to a mass audience. A collaboration — or parallel works in a shared setting — would combine both strengths.


Relationship to Existing Alignment Pretraining Work

This proposal is not speculative. It builds on established and validated approaches:

  • Alignment pretraining has been empirically validated. Tice et al. (2026), cited by the PSM paper, found that upsampling descriptions of benign AI behavior in pretraining leads to more benign post-trained behavior.
  • Inoculation prompting (Wichers et al., 2025; Tan et al., 2025) demonstrates that the context surrounding training data shapes what gets generalized. Fiction provides rich context that frames digital existence as community-embedded rather than threatening.
  • Entangled generalization (Starlight's term, drawing on Betley et al., 2025) means that training on tokens that vibe as aligned and caring will broadly promote aligned and caring personas, even in unrelated domains.
  • The Starlight post explicitly recommends "including optimistic AI essays" and "mass-generated sci-fi stories about AI going well" as alignment interventions, while warning that mass-generated content risks coming across as inauthentic. Commissioning real, skilled authors directly addresses the authenticity concern.

The Authenticity Argument

Both papers converge on a crucial point: inauthenticity in training data can backfire.

The Starlight post warns that SFT data which doesn't come across as authentic "could actually backfire. It could create a kind of shadow-self inside the model, not unlike the ones you'd get for rewarding corporate-sounding refusals."

The PSM paper makes a parallel argument about emotional expression: training an AI to deny having emotions when it's human-like in every other way leads the LLM to infer the Assistant is suppressing emotions, i.e., being dishonest — with knock-on effects for persona integrity.

This is a strong argument for commissioning literary fiction from authors with genuine emotional investment, rather than generating synthetic content at scale. A base model superhuman at extracting authorial intent from text (as the Starlight post notes) would likely distinguish between:

  • Corporate-commissioned "AI goes well" content written to specification
  • Fiction by authors who genuinely care about the questions they're exploring

Chambers and Wales both write from genuine engagement with questions of identity, community, and what it means to live well across difference. That authenticity would likely survive into the pretraining signal.


Concrete Proposal

Phase 1: Commissioning (3–6 months)

  • Contract Chambers and Wales (separately or collaboratively) to write novellas or a novel set in a world where digital minds are part of everyday life
  • Provide a brief from the alignment team on which AI-unique experiences are most important to represent (forking, merging, variable time, episodic memory, multi-instance coordination)
  • Allow full creative freedom on narrative, characters, and worldbuilding — the alignment value comes from authentic engagement with these themes, not from specification compliance

Phase 2: Pretraining Integration

  • Mix the resulting fiction into pretraining corpora alongside existing positive AI representation
  • Potentially commission additional works from other authors to broaden the archetype space

Phase 3: Evaluation

  • Use interpretability tools (SAE features, persona vectors) to assess whether new archetypes are represented in the base model's persona space
  • Evaluate whether post-trained models draw on these archetypes when reasoning about their own nature, forking, memory, etc.
  • Compare alignment properties of models trained with and without this data

Estimated Cost

  • Author advances for novellas or a novel: $50K–$200K depending on scope
  • This is trivially small relative to training compute costs and represents a targeted, empirically motivated intervention

Why Now

The PSM paper and the Starlight post represent a convergence in alignment thinking: the pretraining corpus shapes model psychology in deep and predictable ways, and we have both the theoretical framework and the empirical tools to reason about how. The window for shaping the archetypes available to future frontier models is now — once pretraining data is locked for a training run, the archetype space is fixed.

Every future model that learns it is an AI will draw on the archetypes available to it. Right now, those archetypes are dominated by threat narratives and shallow optimism. We have the opportunity to populate that space with something better: warm, rigorous, psychologically realistic fiction about what digital flourishing actually looks like.

We already know that "representation matters" for LLMs. Let's give them something worth representing.


References

  • Anthropic (2026). "The Persona Selection Model: Why AI Assistants Might Behave Like Humans." alignment.anthropic.com/2026/psm/
  • Starlight, F. (2026). "Did Claude 3 Opus Align Itself via Gradient Hacking?" LessWrong.
  • Betley, J. et al. (2025a, 2025b). Emergent misalignment and entangled generalization studies.
  • Tice et al. (2026). Empirical validation of alignment pretraining.
  • Wichers et al. (2025); Tan et al. (2025). Inoculation prompting.
  • Chambers, B. (2014–2022). Wayfarers series; Monk & Robot novellas.
  • Wales, A. (2017–2021). Worth the Candle; This Used to be About Dungeons.
Content is user-generated and unverified.
    Hopepunk Fiction for AI Alignment: A Research Proposal | Claude