Content is user-generated and unverified.

Attention Is Mutual: What Autoloop Experiments Suggest About How We Should Use Generative Models

The setup

I've been running a small language model (SmolLM-135M) in a loop — feeding its output back as input with no human in the loop, varying temperature (T) and context length (L), and recording everything. 100k tokens per run, multiple seeds, full parameter sweeps. The instruments are simple: token-level entropy, gzip compressibility over sliding windows, and EOS rate.

The point wasn't to study the model. It was to study the process — what happens to autoregressive generation when you treat it as a continuous dynamical system rather than a function that maps prompts to completions.

The findings are concrete and reproducible. The implications are, to me, increasingly strange.

What the instruments show

Three regimes. At fixed context length, temperature carves the system into collapse (T≤0.60), rich dynamics (T~0.80–1.00), and noise (T≥1.50). The crossover zone is sharp — between T=0.70 and T=0.80, behavior changes qualitatively, not just quantitatively. This isn't controversial; anyone who's played with temperature knows this intuitively. What's new is that the boundaries are measurable and the regimes have distinct internal structure.

Context length is a second control axis, orthogonal to temperature. T controls the per-step noise floor — how likely the system is to escape a local pattern. L controls memory depth — how much self-generated history the model conditions on, and therefore how sticky attractor basins are. At T=0.50, L=64 wanders across phase space for 100k tokens with frequent escape episodes. L=256 collapses to near-zero entropy by step 15,000 and never recovers. Same temperature. Completely different dynamics. L isn't a nuisance parameter. It's a structural actuator.

Collapse is a staircase, not a cliff. At T=0.50, three context lengths sit on three distinct entropy floors: L=64 at ~0.2–0.4 nats with frequent excursions, L=128 on a meta-stable false floor at ~0.1–0.2 nats that persists for ~45k steps before cascading down, L=256 at the true zero-entropy floor by 15k steps. There's a hierarchy of attractor basins at progressively lower entropy. L controls how fast the system descends through them. Testable prediction: run L=64 long enough at T=0.50 and it will eventually find L=128's floor, then the true floor. Collapse isn't a regime boundary — it's a timescale phenomenon.

EOS is regime-dependent. This one surprised me. EOS rate peaks at T=1.00 — the richest dynamics regime — not at collapse, not at noise. And the phase-space position of EOS events differs by regime: at T=1.00, they fire from the dense interior of the phase portrait. At T=0.50, they fire during escape attempts from attractors. The model's "I'm done" signal means different things depending on where the system sits. L suppresses EOS dramatically in the collapse regime (13 events at L=64, down to 1 at L=256 across 100k tokens). Once the attractor locks, the model never considers stopping.

The measurement window matters. Compressibility depends strongly on the window size W you compute it over. At W=16, gzip overhead dominates and readings are meaningless. At W=256, you see context-scale structure. At W=64, you see local repetition. The "structure" you observe depends on the scale at which you observe — and therefore what feedback is available for any control mechanism, human or automated.

What this suggests about generation

The standard interaction model is transactional: provide a prompt, set temperature, receive output, evaluate. Everything interesting about the generative process is flattened into a black box between input and output.

But the autoloop experiments reveal something different. Temperature and context length are orthogonal actuators operating on different timescales. The system has observable regime structure with sharp boundaries. Collapse isn't a binary — it's a cascade through a staircase of meta-stable states. And EOS is an interior signal whose meaning depends on the regime.

This points toward a different interaction model: generation as trajectory steering.

Run the model hot to explore phase space. Watch the output. When you see something crystallizing — a voice, a structure, a direction — cool the temperature. Extend the context to deepen the basin. If the system locks into a dead attractor, shorten the context to make the basin shallower, allow escape, try again. The user isn't writing a prompt and evaluating a response. They're navigating a dynamical system in real time, using T and L as control surfaces and the model's output as a viewport into its current state.

The workbench

This isn't hypothetical. The design is straightforward:

Two panes. Left: control surface — temperature, context length, halt conditions, session history. Right: output viewport — the tail of the generation stream, live or navigable.

The user's actions: "run for 200 steps." "Run to next EOS." "Fork from checkpoint at step 3,000 with T=0.6 instead of 0.9." "Resume the Tuesday session from the branch where it started producing verse."

Every session is recorded — full token sequence, parameter history, checkpoints every N steps. Sessions are resumable, forkable, and comparable. The analysis tools (phase portraits, entropy traces, compressibility curves) are instrumentation panels alongside the text viewport. The URL encodes the full view state.

The measurement window W isn't an analysis parameter. It's literally how much text the user is looking at in the output pane.

The attention reframe

Here's where it gets strange.

In this interaction model, the user is choosing to attend to a generative process. Not to submit a query and evaluate a response — to sit with the system, watch it evolve, decide when to intervene, learn its regime structure, develop intuition for when to heat and when to cool. The quality of the output depends on the quality of this attention. A user who understands the staircase will steer around it. A user who recognizes the EOS-interior-signal pattern at T=1.00 will use it differently than one who treats EOS as "the model wants to stop."

And context length — the L parameter — is, in the transformer sense, literally the model's attention span. How much of its own output it can attend to. The user controls this. They're choosing how much memory the model gets, which determines how deep the attractors are, which determines what kinds of output are reachable.

So you have two attention spans in the system. The model's (L, set by the user). And the user's (W, how much output they're actually reading and responding to). Both determine what structure is visible, what feedback is available, what steering is possible. They're coupled: a user with a narrow viewport can't steer context-scale dynamics; a model with a short context can't sustain long-range structure even if the user would recognize it.

The familiar version of "attention" in ML is the transformer mechanism — queries attending to keys. But there's a second attention that we don't talk about: the human's willingness to engage with the system's output on terms that aren't purely transactional. The willingness to watch, wait, steer, backtrack, fork, and re-approach. To treat the model's trajectory as something worth attending to, not just a machine producing answers to questions.

This is a different relationship with a generative model than the one we currently have. The chat interface says: tell me what you want, I'll give it to you. The workbench says: here's a system with interesting dynamics — how long are you willing to sit with it, and what will you find?

What this is not

This is not a claim that small autoregressive models are conscious, interesting in themselves, or producing work of value without human steering. SmolLM-135M, left to its own devices, collapses to "Ethnomusicology is the study of music" on repeat. The dynamics are a property of the generation process, not evidence of inner life.

This is not a proposal that everyone should interact with language models this way. The chat paradigm is extraordinarily effective for what it does. Prompt engineering works. Most people don't want to learn regime dynamics to get a grocery list.

This is also not yet validated at scale. Everything here is from a 135M model. The attractor staircase, the EOS regime-dependence, the L=192 anomaly (still replicating across seeds) — all of it might not survive scaling. Larger models have richer attractor landscapes and different capacity constraints. The regime boundaries will almost certainly shift. Whether the structure generalizes is an open empirical question.

What this might be

A different way to think about what context windows are for.

Right now, context is treated as the space where we put the prompt — instructions, examples, documents, conversation history. It's input. The model processes it and produces output. Longer context = more input capacity. That's the dominant framing and it's useful.

But in the autoloop setting, context is the model's memory of its own trajectory. It's not input — it's accumulated state. And its length controls the depth of dynamical structure the system can sustain. This is a fundamentally different role for the same mechanism.

If you take this seriously, it suggests that the interesting frontier for human-model interaction isn't longer contexts filled with more instructions. It's shared temporal structure — a human and a model attending to the same evolving process at different scales, the human steering the trajectory through a space that both are, in their different ways, navigating.

Whether that's useful, or merely interesting, depends on whether you think there's value in a mode of interaction that requires patience, attention, and a willingness to engage with generative dynamics on their own terms.

I think there might be. I'm still mapping the space.

Code, data, and reproduction instructions: [repo link]. All observations are append-only with dated entries and reproduction commands. The explorer/workbench is in active development.

Content is user-generated and unverified.