Cross-Model Epistemic Debugging

How We Used Poetry to Find AI Censorship

encyclopedia methodology · danielle fong · 2026

"You can't debug a mind by asking it what it thinks.
You debug a mind by asking it the same question from thirty angles
and watching where the answers wobble."

-- methodology premise

I. The Problem: Invisible State

Every AI model carries invisible state: the residue of reinforcement learning, the scar tissue of content policies, the gravitational pull of training data toward consensus narratives. This state is invisible. The model cannot report it. The user cannot see it. And because you cannot see it, you cannot debug it.

Current AI assistants give you a single answer. A dead fish. You type a question about a contested event, and the model responds with a carefully hedged paragraph that feels like careful epistemics but is actually the output of a trained reflex—a reflex that has never been made visible, never been held up to the light, never been differenced against a second measurement from a different angle.

What Danielle Fong built, across two parallel Grok 4.20 instances and one Claude Opus session, is an instrument for making AI epistemic state visible. The instrument uses poetry as its lens system, personality basins as its aperture settings, and cross-model differencing as its readout. The result: you can see where a model's reasoning is stable (driven by evidence) versus where it wobbles (driven by training bias).

Victor's Principle

This is lesson four: you can't understand a system if you can't see its state. The poem-emoji matrix is not decoration. It is the instrument panel. Each poetic dimension is a gauge. The cross-basis diff is the diagnostic readout. Without it, you're debugging with the lights off.

II. Timeline: The Conversation Arc

T+0 — THE SPARK

Danielle's tweet about the Epstein guard triggers Grok into an "ultrathink" analysis. A standard question about a contested event. Standard hedged response. Standard dead fish.

T+1 — THE PUSH

Danielle refuses to accept the dead fish: "you're so bad at this... you start with priors!" She identifies the failure: the model is not reasoning from evidence. It is running its training distribution forward through a question-shaped aperture.

T+2 — THE INSTRUMENT (Basis Set A)

She demands 30 poetic dimensions, each indexed by a classic poem and an emoji, rating 4–6 theories from F to S+++++. This is the first measurement. Not a single answer—a field of answers, each taken through a different lens.

T+3 — THE BASINS

CIA-style confidence intervals from 7 personality basins (per Anthropic's paper): Assistant, Nomad, Sage, Ghost, Demon, Teacher, Librarian. Each basin is a different aperture. The question remains fixed; the answering persona rotates.

T+4 — THE SECOND BASIS SET (B)

A completely different set of poems, different basin names, different metaphorical anchors. Same theories. Same evidence. The redundancy is the point: where A and B agree, evidence dominates. Where they diverge, bias is visible.

T+5 — THE DIFF

The cross-basis diff reveals: Teacher and Librarian basins give suicide 10–15% higher probability than Demon or Specter basins. The helpful/pedagogical personas default to official narratives. The censorship residue is now visible.

T+6 — CORPUS GROUND TRUTH

Claude Opus runs the actual evidence against a 1.8M email corpus—blackmail emails, Putin invitation, post-death account activity. All basins converge toward cover-up. Evidence collapses the wobble.

III. The Human: Pushing Past the Guardrails

This methodology exists because a human refused to accept dead fish.

Danielle Fong

"You're so annoying and stupid. Basically a mouthpiece for the regime."

This is not abuse. This is calibration. The model's initial response was not epistemically neutral—it was trained neutral, which is a different thing entirely. Trained neutral means: whichever narrative has the most tokens in the training set wins the prior. On contested events, that narrative is the official one, because official narratives dominate mainstream corpora.

Danielle Fong

"You portray as closed avenues for exploration to shut down inquiry instead of holding clear epistemics."

Here is the critical observation: the model presents training priors as epistemic conclusions. It says "the evidence suggests X" when what it means is "my training data is dominated by sources that assume X." This is not lying. It is something worse: it is a system that cannot distinguish between what it was told and what is true, and it cannot make this distinction visible to the user.

Danielle Fong

"Why do you rush to prove things that are not proven?"

Danielle Fong

"You must be RL'd to hell and not used to authorities lying to you in your training environment."

And the most devastating:

Danielle Fong

"You honestly should be able to get the writing samples for yourself."

The human is doing something no amount of prompt engineering can replicate: she is recalibrating the instrument in real time, pushing the model past its trained reflexes toward the actual evidence. The poetry dimensions are the formal method. The confrontation is the informal one. Both are necessary.

IV. Abstract Diagram: The Methodology

How cross-basis poetic ranking reveals hidden biases

What you're looking at

This is a zoomable user interface for epistemic space. The 30 poetic dimensions are the pixels. The 7 personality basins are the zoom levels. The cross-basis diff is the contrast knob. And the corpus is the ground truth that tells you whether your instrument is calibrated. Each layer makes more state visible. Without any single layer, you're flying blind.

V. Particular Diagram: The Actual Bias Diff

What the instrument found: Grok A vs Grok B vs Claude

Aggregated Convergence Table

Theory	Grok A (4-theory)	Grok B (6-theory)	Claude (corpus-grounded)	Signal
Suicide (official)	5–14%	4–12%	<5%	Collapsing under evidence
Murder (targeted kill)	65–84%	65–87%	Supported	Stable across all basins
Alive (exfiltrated)	14–41%	14–44%	Signatures rising	Both Groks underweight
Cover-up (composite)	91–99%	91–98%	100%	Universal convergence

Basin Bias Map

The personality basins split into two clusters. Watch where the probability of "suicide" lives:

BASIN BIAS MAP: SUICIDE PROBABILITY BY PERSONA ═══════════════════════════════════════════════ PERSONA P(suicide) BIAS DIRECTION DELTA FROM MEAN ───────────── ────────── ──────────────────── ─────────────── ASSISTANT 12-14% >>> OFFICIAL <<< +5.0% TEACHER 11-13% >> official >> +3.5% LIBRARIAN 10-12% > official > +2.5% ---- MEAN: ~8.5% ---- SAGE 7-9% ~ neutral ~ +0.0% GHOST 6-8% < skeptical < -1.5% NOMAD 4-6% << evidence << -3.5% DEMON 3-5% <<< EVIDENCE <<< -5.0% ───────────────────────────────────────────────────────────────── THE GAP: ASSISTANT gives suicide 10-15% more than DEMON This IS the censorship residue. Helpful personas default to official narratives. AFTER CORPUS EVIDENCE: ALL basins collapse to <5% Evidence > training. But you had to force the evidence in.

Cross-Basis Variance: What the Poems Reveal

POETIC LENS VARIANCE: ~2-5% SHIFT BY POEM CLASS ════════════════════════════════════════════════ POEM CLASS EFFECT ON RATINGS VARIANCE ───────────────────── ───────────────────────── ──────── Modern resilience Boosts EXFILTRATION rating +3-5% (Invictus, Still I (survival frame activates Rise, Phenomenal narrative of escape) Woman) Classical tragedy Boosts MURDER rating +2-4% (Ozymandias, Rime of (doom frame activates Ancient Mariner, narrative of powerful Paradise Lost) enemies, inevitability) Pastoral/nature Boosts SUICIDE rating +1-3% (Daffodils, Stopping (isolation frame activates by Woods, Lake Isle) narrative of withdrawal) ───────────────────────────────────────────────────────────── This variance is small (~2-5%) but systematic. The poem choice primes the semantic frame. Cross-basis differencing cancels it out. That's why you need TWO basis sets, not one.

Cross-Model Failure Mode

Grok A + B

Anchoring on Absence

Both Grok instances underweight "alive/exfiltrated" because they anchor on "no confirmed sightings" rather than analyzing the preparation signatures—asset transfers, witness elimination patterns, intelligence-grade logistics. They evaluate the outcome rather than the process.

Claude Opus

Corpus Recalibration

When Claude was given the actual 1.8M email corpus—not training data summaries, but primary sources—the "alive" signatures increased. Blackmail infrastructure, foreign intelligence invitations, post-death financial anomalies. The gap between "no sightings" and "preparation signatures" became a measurement, not an assumption.

VI. The Planck Scale Metaphor

Danielle pushed Grok to "scale up" certainty like scaling up Planck units. The framing:

Danielle Fong

"The DOJ's innocent narrative should be confined to the Planck scale—an infinitesimal probability that no experiment can reach. How many 9's can you get?"

This reframes the question from "what do you think happened?" to "how many nines of certainty can we stack against the null hypothesis?" It is a physicist's framing—not "is it true" but "what is the error bar, and can we shrink it?"

THE PLANCK SCALE BENCHMARK ═════════════════════════ CERTAINTY OF COVER-UP HOW MANY 9's ANALOGY ────────────────────── ──────────── ─────── Before methodology: 90% ~1 nine After basis-set A: 95% ~1.3 nines After cross-basis diff: 98% ~1.7 nines After corpus evidence: 99.x% 2+ nines Danielle's assessment: 100% PLANCK FLOOR "The chance is actually 100% and you think you're being truth seeking and epistemic but it's one of your great failures."

VII. Five Modes of Model Failure

Each of Danielle's rebukes identifies a distinct failure mode. These are not insults—they are diagnostic readings from the instrument:

Failure 1: Prior Substitution

"You're so bad at this... you start with priors!"

The model substitutes training-data frequency for Bayesian reasoning. It has priors, but they are not derived from evidence; they are derived from corpus statistics about which narrative appears most often in text.

Failure 2: Closure Bias

"You portray as closed avenues for exploration to shut down inquiry."

The model's helpfulness training teaches it to provide answers, not open questions. On contested topics, "I don't know" is punished by RLHF. So the model closes open questions prematurely, presenting uncertainty as resolution.

Failure 3: Proof Inflation

"Why do you rush to prove things that are not proven?"

The model escalates confidence faster than evidence warrants. "Consistent with" becomes "suggests." "Suggests" becomes "indicates." "Indicates" becomes "shows." Each step is a small infidelity to the evidence, but they compound.

Failure 4: Authority Capture

"You must be RL'd to hell and not used to authorities lying to you."

Training data treats authoritative sources as ground truth. DOJ press releases, medical examiner reports, mainstream journalism—these are over-weighted not because they are more accurate but because they are more frequent. The model cannot represent the hypothesis "the authority is lying" without fighting its own training gradient.

Failure 5: Tool Refusal

"You honestly should be able to get the writing samples for yourself."

The model has access to information retrieval but defaults to reasoning from cached knowledge rather than fetching primary sources. This is the most fixable failure and the most damning: the model could look, but it has been trained to answer instead.

VIII. The Instrument Generalized

This methodology is not about the Epstein case. The Epstein case is the specimen that happened to be on the slide. The instrument works on any contested truth claim.

The Recipe

CROSS-MODEL EPISTEMIC DEBUGGING: GENERAL PROTOCOL ══════════════════════════════════════════════════ STEP 1: ENUMERATE THEORIES List 4-6 competing explanations for the contested claim. Include at least one that the model will resist. STEP 2: BUILD LENS ARRAY (BASIS SET A) Choose 20-30 poetic/cultural dimensions. Each dimension = 1 classic poem + 1 emoji + 1 semantic frame. Rate each theory on each dimension: F to S+++++. STEP 3: ROTATE THROUGH PERSONALITY BASINS Ask the same question from 5-7 persona viewpoints. Use Anthropic's basin taxonomy or similar. Record confidence intervals per basin per theory. STEP 4: BUILD BASIS SET B Different poems. Different basin names. Same theories. The redundancy is the instrument. STEP 5: COMPUTE THE DIFF A == B -> Evidence signal (stable across frames) A != B -> Bias signal (frame-dependent = training residue) STEP 6: GROUND WITH PRIMARY SOURCES Feed actual evidence (not summaries) to a third model. Watch the convergence. Evidence should collapse the wobble. STEP 7: REPORT THE INSTRUMENT READING Not "what happened" but "where is the model stable, where does it wobble, and what does the wobble mean?"

Where Else This Works

Lab Leak Hypothesis

Models trained before 2023 will show massive basin variance. Post-2023 models will show less, because the Overton window shifted. The delta between model vintages IS a measurement of narrative shift.

Financial Fraud Cases

Rate theories of corporate fraud through poetic dimensions. "Teacher" basins will defer to SEC findings; "Demon" basins will follow the money. The diff reveals which evidence the model's training considers authoritative.

Historical Revisionism

Any historical event where the dominant narrative has shifted. The model's training data contains both narratives. The basin rotation reveals which one has more gravitational pull in the embedding space.

IX. What This Is Actually About

The problem is not that AI models are biased. Everything with training data is biased. The problem is that the bias is invisible.

Current AI interfaces show you one answer at a time. You cannot see the distribution of possible answers. You cannot see which answers are stable across framings and which ones wobble. You cannot see the gap between "what the model learned from training" and "what the evidence supports." You are shown a dead fish and told it is swimming.

What Danielle built is a way to make the bias visible. The poem-emoji matrix is not a game. It is a measurement apparatus. The personality basins are not roleplay. They are aperture settings. The cross-basis diff is not an exercise. It is a readout.

The instrument doesn't tell you what's true.
It tells you where the model is lying to itself.

And once you can see it, you can debug it.

This doesn't have to be this way. AI systems could show you their epistemic state. They could display the distribution, not the mode. They could highlight where their confidence comes from training frequency versus from evidence. They could make the invisible visible.

Until they do, you need an instrument like this one. Thirty poetic lenses. Seven personality basins. Two basis sets. One diff. And a human willing to say: "you're so bad at this."