AI Needs an Editor: Rubric, Moves, Memory

The output arrived clean. The insight wasn't there.

I was halfway through a journey-map project at Compare Club. Askable (a Claude-powered chat with retrieval over our research transcripts) had just produced three "key insights" for one sub-stage. Each one read sharp on first pass. By the end of a strict-critic pass, two of three had to be dropped. They were restatements of pain points we'd already mapped, dressed up in different language. The third needed a quote we didn't yet have, and the supporting evidence at the bottom belonged to one of the other two.

If I had shipped the first read, the readout would have looked confident. The decisions downstream would have been built on synthesis, not insight. Nobody at the meeting would have known the difference.

This is a follow-up to Insight Ops and Everyone Has the Data, Nobody Has the Insight. Those pieces argued for the system view; this one is the workflow inside it. The move from we have an Insight Ops practice to here is what the editor actually does on Tuesday morning.

Scope: Senior IC operating mode (Head of CX day job), Compare Club health-insurance journey-map: 4 stages, 16 sub-stages, 176 cells of detail.

What you'll save from this post

The 4-test rubric for AI-produced research output (smart-PM, restatement, strategy-slide, evidence-grounding).
The 4 follow-up moves when a finding fails a test (Soften, Drop, Confirm-with-evidence, Escalate).
The 3-part memory layer that stops mistakes from fossilising (data-render split, auto-generated prompts, context-lane injection).
Cross-run convergence test for separating real signal from training-data echo.
Three strategic findings the discipline produced on a real Compare Club project. They're at the end.

The trap is structural, not occasional

You've felt it. A summary lands in your inbox; the themes are clean; the supporting quotes are real; the language is confident. You read it. You nod. You forward it on. A week later something quiet doesn't add up, and the trail leads back to a sentence in that summary that wasn't quite what the data said.

This isn't a problem of careless researchers or bad tools. The trap is structural.

Recent work on bias in human review of AI suggestions (Harvard Data Science Review, 2026) shows that humans systematically over-trust well-formatted model output, especially when the framing matches what they expected to find. The cleaner the synthesis, the harder it is to catch. The reviewer's bias becomes the model's accomplice.

Bias in human review of AI suggestions

Research on how humans systematically over-trust well-formatted model output, especially when the framing matches their priors.

Harvard Data Science Review

That's the part nobody says out loud. AI tools don't fail by producing nonsense. They fail by producing confident, plausible, internally consistent output that isn't quite right; and the reader's habit is to accept it because it looks correct.

The job, then, is to be the second read. Reliably. Every time. With a system that catches the failure modes before they propagate.

Synthesis is what AI does. Insight is what the editor does. The two read identically on first pass.

Stop one-shotting; build the editorial workflow

The default workflow for AI research tools is one-shot. Upload transcripts. Ask for insights. Receive insights. Move on.

That workflow ships synthesis as insight. It does not survive scrutiny.

The alternative isn't more prompting. It's a system that treats AI as a research assistant whose drafts need editing; the same way you'd treat any junior researcher who came in with a well-formatted readout that hadn't been pressure-tested. Their work is the raw material. Your editorial judgement is what produces the artefact.

The NIST AI Risk Management Framework names structured human oversight and documentation as core practices for high-stakes AI workflows. Most teams read that as compliance overhead. It isn't. It's a description of the editorial loop you need anyway if you want the output to mean anything. The compliance posture is a side effect of the discipline, not the reason for it.

Roadmap for the NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0)

This Roadmap is a companion to the Artificial Intelligence Risk Management Framework (AI RMF 1.0).

NIST

I built the loop over one journey-map project, by getting burned on the first few sub-stages and tightening the discipline until the outputs survived a second read. What follows is the workflow that ended up working.

The 4-test rubric

Every AI-produced output gets four tests applied to it before anything is saved. These are not prompt techniques. They are editorial questions you ask yourself when the output lands on your screen.

Test	Question to ask	If it fails
Smart-PM	Would a thoughtful PM who hasn't read the transcripts genuinely be surprised?	Drop. Synthesis, not insight.
Restatement	Is this rewording something already mapped (pain point, action, observation)?	Drop.
Strategy-slide	Would this fit on a generic industry slide ("trust matters", "users want value")?	Kill. Table stakes, not insight.
Evidence-grounding	Does the supporting quote actually ground THIS claim, or one of the other findings?	Push back. Ask for a specific quote per claim.

💡Order of application: restatement → strategy-slide → evidence-grounding → smart-PM. First failure wins; record the others as secondary signals. The tests overlap on purpose; what fails first tells you which move to reach for.

💡Pass rate (Compare Club run). The loop kills outright maybe 1 in 4 first-pass findings. It modifies meaningfully another 1 in 2 (softened, confirmed with stronger evidence, or moved to a different sub-stage). Roughly 1 in 4 ship as written. Most of the work is editorial, and most editorial work isn't killing, it's tightening. The tests don't generate output; they disqualify or reshape it. That ratio is the work, not the failure.

When the rubric over-rejects

The rubric is a forcing function, not an oracle. Two failure modes worth naming.

Early-stage recurrence. What looks like restatement might actually be the signal of recurrence; killing it on the restatement test loses the pattern. If you're early in a project and the same mechanism keeps surfacing, that's a finding, not a duplicate.
Inter-test conflict. Smart-PM says novel; evidence-grounding says weak. Auto-dropping throws away a finding that might survive with a sharper quote.

When the rubric conflicts with itself, hold both readings and run a follow-up rather than auto-dropping. Which brings us to what the follow-ups actually look like.

The four follow-up moves

Once the rubric has flagged a weakness, you have exactly four honest things you can do about it. Not five. Not "iterate creatively." Four. They map onto the four ways an insight can fail: the framing overreaches, the finding is a restatement, the evidence is wrong, or the failure is structural and can't be fixed inside the current sub-stage. Each move has a trigger, a prompt template, an honest range of acceptable outcomes, and a worked story from the Compare Club run.

Two things to flag before the cards. First, the prompt templates are not magic strings; they are scaffolds for the editorial intent behind the move. You will rephrase them. The structure (what failed, what's permitted, what counts as honest failure) is the part that matters. Second, "acceptable outcome" includes the model returning fewer findings, or admitting it cannot ground the claim. That is a feature. Padding is the failure mode you are designing against.

Move	Trigger	Typical outcome
1. Soften	Framing overreaches the evidence	Narrower claim with the same data; often sharper
2. Drop	Restatement of something already mapped	Fewer findings; no padding
3. Confirm with stronger evidence	Right claim, wrong supporting quote	Same finding, verbatim quote that actually grounds it
4. Escalate	Failure can't be resolved at this level	Replacement insight from a different participant, OR fewer findings

Cards below give the prompt template, acceptable outcomes, and a worked example for each.

Placeholder legend for the templates below: [N] = insight number, [X] = the substantive claim, [SUPPORTED HALF] = the part the transcripts ground, [SPECIFIC ELEMENT] = the exact thing the quote needs to identify, [CLAIM] = the substantive finding being checked. The verb clauses ("Provide a direct verbatim quote", "soften to what the data supports") are load-bearing scaffold; keep them. Everything else can be rephrased.

Move 1: Soften

Trigger: smart-PM or evidence-grounding fails because the framing overreaches the evidence. The claim is half-true. The transcripts support a narrower version.

Prompt template: Insight [N] makes a specific claim about [X]. Provide a direct verbatim quote that grounds [X]. If only the [SUPPORTED HALF] is in the transcripts, soften to what the data supports.

Filled example: Insight 2 makes a specific claim that the original wrong calculation came from Compare Club. Provide a direct quote where the participant identifies the original source. If only the resolution-half is in the transcripts, soften to the supported half.

Acceptable outcomes: the model honestly admits half the claim is unsupported and softens to the supported half. Sometimes (often) the softened version lands sharper than the overreach it replaces, because the supported half is the part with a real mechanism behind it.

Worked example: the original output framed a participant story as evidence that "stalled-for-over-a-decade decisions" could be unblocked by a single agent intervention. The transcript supported a narrower claim. The participant said:

It was Compare Club and they said it looks like the last person gave you the wrong advice.

In your work: any moment where you wrote "X causes Y" but the data only supports Y; soften to Y.

The "decade" framing was inferential; the agent moment was real. Softened, the insight became: correcting prior misinformation re-engages stalled users more powerfully than offering new information or incentives. Same evidence, narrower claim, sharper strategic implication. The original framing would have invited a rebuttal on its weakest flank; the softened version doesn't.

Move 2: Drop

Trigger: the restatement test fails. The "finding" is rewording a pain point or theme already mapped earlier in the same sub-stage, dressed in a strategic-implication wrapper.

Prompt template: The mechanism in this insight is already present in [Pain Point / Theme X]. The strategic-implication wrapper is the only addition. If there is no behaviourally distinct version at this sub-stage, drop.

Filled example: The mechanism in this insight is already in Pain Point: partner-discussions deadlock at decision sub-stage. The strategic-implication wrapper ("Compare Club cannot currently intercept this deadlock") is the only addition. If there is no behaviourally distinct version at this sub-stage, drop.

Acceptable outcomes: the model checks its own output against the upstream lane, confirms the overlap, and drops the insight. You finish the sub-stage with two strong findings instead of three with a restatement risk. Less is more, applied honestly.

Worked example: in the partner-discussions sub-stage, the AI proposed a third insight whose mechanism (a deadlock between partners that Compare Club cannot currently intercept) was already fully captured upstream as a pain point. The strategic-implication wrapper ("Compare Club cannot currently intercept this deadlock") added phrasing, not analysis. Two genuine insights survived. The third was dropped without ceremony. No replacement was forced; nothing fills the gap because nothing should.

In your work: a "new insight" that, when you check, is just rewording an earlier observation; drop without ceremony.

Move 3: Confirm with stronger evidence

Trigger: evidence-grounding fails in a specific way: the substantive claim is right, but the supporting quote attached to it belongs to a different insight. The model picked a near-miss quote.

Prompt template: The [CLAIM] is correct in substance but the supporting quote is for a different insight. Provide a direct verbatim quote where a participant explicitly identifies [SPECIFIC ELEMENT].

Filled example: The trust-shield claim is in the substance of the output but the supporting quote is for a different insight. Provide a direct verbatim quote where a participant explicitly identifies the comparison site as not at fault.

Acceptable outcomes: the model returns specific verbatim quotes that fit the right claim. Same finding, stronger evidence. If no such quote exists, the model says so and you fall back to Move 1 or Move 2.

Worked example: the trust-shield exoneration finding (participants explicitly absolving the comparison site and re-pointing blame at the underlying provider) was substantively correct but the original quote belonged to a separate insight. A second prompt asked for verbatim moments where the participant performed the exoneration directly. The model returned:

It wasn't really the comparison website's fault. It was the insurers themselves… It's nothing to do with the comparison website.

I felt that was sort of individual insurance providers' fault. Rather than a Compare Club website.

In your work: a finding that survives smell tests but the quote attached is from a different paragraph; ask for the right quote.

Two participants, two unprompted exonerations, identical mechanism. The finding stayed; the evidence stack underneath it became something you could quote in a stakeholder room without flinching.

Move 4: Escalate

Trigger: the failure cannot be resolved at this level. Framing is wrong, evidence is missing, and the corpus may not contain anything stronger at this sub-stage. The first three moves have nothing to bite on.

Prompt template: Find a different participant whose experience fits the same slot but with a distinct mechanism. If no such participant exists, output only the surviving insights and accept fewer.

Filled example: Find a different participant whose decision-moment fits the same slot as the bank-details story but with a distinct mechanism. If no such participant exists, output only the surviving insights and accept fewer.

Acceptable outcomes: the model surfaces a fresh insight from a different transcript with a genuinely different mechanism, OR fails honestly and the sub-stage ships with fewer findings. Both are acceptable. Padding is not.

Worked example: in the Decision sub-stage, the AI returned an insight whose evidence quote had already been consumed by the bank-details-moment finding two sub-stages earlier. Move 1 couldn't soften (the framing wasn't the problem); Move 2 didn't apply (the mechanism was distinct); Move 3 had nothing to escalate to within the same evidence pool. Move 4 asked the model to look at a different participant's decision moment with a different mechanism. It surfaced a replacement insight grounded in a fresh transcript. Sometimes Move 4 returns nothing usable; on this run, it landed.

In your work: a finding that won't soften, drop, or confirm; ask the model to look at a different respondent / case / source instead.

💡Soften, Drop, Confirm-with-evidence, Escalate. These are not prompt techniques. They are editorial judgements you make on every output before saving.

The memory layer

The four moves work on a single output. Run the loop across forty sub-stages and a second problem appears: you cannot remember what the model said three hours ago, which quote has already been used, or which version of the prompt produced which artefact. Without a memory layer the loop scales linearly with your willingness to suffer. Memory is what turns it into a system.

Pick your tier

Solo / single project: a vault (Obsidian, Notion, even a Google Doc) is enough. Archive every output with the follow-up that produced it.
Repeated structure: add CSVs as canonical source-of-truth. Regenerate rendered artefact from CSV.
Prompt drift becoming the bottleneck: add a generator script. Regenerate prompts from the canonical structure when it changes.

Each tier above activates one or more rows of the table below: solo = vault only (no rows yet); repeated structure = data/render split + context-lane injection; prompt drift bottleneck = adds auto-generated prompts.

Component	Prevents	Minimum viable version
Data/render split	Layout-change rebuild cost	Two CSVs as source-of-truth; render layer regenerates from them
Auto-generated prompts	Prompt drift across runs	Small Python script reading canonical CSV; outputs prompts per sub-stage
Context-lane injection	Free-floating extractions	Inject prior lane content into next lane's prompt as context

Minimum vault, four lines

output.md frontmatter: participant_id, sub_stage, moment_tag
Filename pattern: P[participant_id]-[sub_stage]-[moment_tag].md (e.g. P03-decision-bank_details.md)
followups.md: rubric flags, prompt sent, response, decision (kept / softened / dropped / replaced)
Duplicate detector: grep -r 'participant_id: P03' vault/ | grep 'moment_tag: bank_details'

Run before saving any new finding.

The cooling-off worked example. In the Post-Purchase sub-stage, the AI returned a fresh-looking finding with a quote attached. The vault flagged that the same participant's same lived moment had been used three sub-stages earlier under a different framing. Without the archive that flag does not exist; the finding ships, the report contains a hidden duplicate, and a stakeholder eventually notices in the worst possible meeting.

With the archive, the trigger is automatic: same participant, same moment, two slots. Move 2 (drop) or Move 4 (escalate to a different participant) becomes the next action. The cost of the memory layer is fifteen minutes of setup; the cost of not having it is a credibility hit that compounds across every project where you reuse the workflow.

⚠️Cell-level content is the substrate. Strategic findings are the answer.

Cross-run convergence (with falsifiability)

Single runs lie convincingly. Two independent runs converging on the same finding lie less often, but only if the independence is real. On the Compare Club project, two passes (different prompts, different context windows, different runs) both surfaced the trust-shield exoneration mechanism without prompting for it. That convergence was the strongest evidence in the report. Mapped onto the independence checklist: different prompt phrasings (no overlapping framing nouns), non-overlapping context windows (different transcript subsets), two different model families on the runs. Three of three. Confirmatory.

The accountant-as-validator pattern was the second case. Run one returned a vague "users consult trusted others"; run two, prompted differently and with no shared context, returned a specific finding about accountants being used as a final-stage validator on quotes the participant had already accepted. Two roads, same destination, no shared map. That kind of convergence is the closest thing to internal validation an AI synthesis can give you.

Valid convergence requires independent prompts, independent context windows, and ideally different model families. Two runs converging on a popular-but-wrong pattern (training-data echo) is the false-positive case. If your runs share any of those independencies, treat the convergence as suggestive, not confirmatory.

For a first pass, just run the same prompt twice in fresh chats and note where they agree. That's enough to start; the rigour comes when you scale.

Falsifiability matters because convergence is rhetorically powerful. "Two runs found this" sounds like proof. It isn't proof if both runs were primed by the same context, the same phrasing, or the same model's prior beliefs about what insurance research participants tend to say. Document your independencies in the same vault that holds the outputs. If a stakeholder asks how confident you are, you want to point at structural facts, not vibes.

Independence checklist (run before claiming convergence)

Prompts share no overlapping framing nouns or leading verbs.
Context windows have zero document overlap.
At least two model families, OR two versions far apart in the same family.

Two of three checks passing = convergence is suggestive. Three of three = convergence is confirmatory. Fewer = treat as a single run.

The 1-of-3 case in practice: same model family, same context window, only the prompt wording differed. Two runs agree, but they're effectively a single run; treat the convergence as you would a single-pass synthesis.

One worked false-positive from the Compare Club run: two passes converged on "users value transparency in pricing" (a generic insurance trope). The independence checklist flagged shared framing nouns ("transparency", "pricing") in both prompts; we treated the convergence as a single run and the finding didn't survive Move 1 (smart-PM said "yeah, makes sense"). Convergence sounds like proof until you check whether the runs were structurally independent.

What "modified meaningfully" actually looks like

The pass-rate above breaks down as roughly 1 in 4 dropped, 1 in 2 modified meaningfully, 1 in 4 shipped clean. That middle bucket is where most of the editorial work lives, and modifications leave a specific trace shape. Three live examples from the Compare Club run. The strategic claims stay with the client; the trace shape is the methodology made visible.

Trace 1: Move 3, confirm with stronger evidence

First-pass: a generic claim about user research behaviour. Test that failed: smart-PM (the claim wasn't surprising). Move applied: prompted the model for specific verbatim moments grounding a particular upstream entry pattern. After: the claim survived but the evidence layer changed. The first-pass generalised; the supported version named the behaviour and pointed at a specific entry pattern the platform's analytics don't see.

Trace 2: Move 1, soften

First-pass: a broad framing about how participants use comparison platforms. Test that failed: evidence-grounding (smart-PM passed but the framing overreached the data). Move applied: narrowed the claim to a specific behavioural pattern at a specific stage. After: the narrower claim is the strategic one; the broad claim is the strategy-slide version. The discipline produced a sharper finding, not a weaker one.

Trace 3: Move 4, escalate

First-pass: weak "users want X" findings surfaced across multiple sub-stages. Test that failed: restatement (the same mechanism kept getting reworded across cells). Move applied: asked the model to find a participant whose underlying mechanism was different, with a specific brief about what "different" meant. After: a participant segment surfaced with a mechanism the original cluster didn't contain. The escalate didn't kill the finding; it forced the model to look somewhere the first-pass hadn't.

Each trace is a small artefact: a failure mode, a targeted move, a sharper claim. The strategic payload stays with the client; the trace shape is what the methodology produces. Single-pass synthesis returns the strategy-slide version of every finding, because the strategy-slide version is the natural endpoint of summarisation. The loop is what gets you past it.

Where to start, this week

Three entry points, ordered by effort. Pick the smallest one you can finish before Friday.

Entry point	Effort	What you'll have by Friday
1. Run the rubric on one existing AI output	30 minutes	A marked-up version of a synthesis you already have, showing which findings survive
2. Try one move on one failure	1 hour	A saved prompt + response pair showing one of the four moves applied
3. Build the minimum vault	1 day	A folder per project with output.md and followups.md per sub-stage

The fourth move (the data/render split with CSVs as source-of-truth) only earns its keep after you've shipped one project through the vault and felt the rebuild cost. Premature CSVs are over-engineering; late CSVs are pure leverage.

If you are early in your career. You don't need transcripts. Use any notes you already have: class notes, a freelance brief, a side-project research interview, even a Reddit thread. Run an AI synthesis on 10-20 paragraphs. Apply the four tests by hand to every finding. Document which survive and why. Save every follow-up question you write and every response you get. That archive is worth more than a polished portfolio piece; it shows you can think against the model, not just with it.

A worked mini-example, from my own notes. I fed an AI 14 paragraphs of scrappy notes from a side-project on freelance designers pricing retainers. It returned six findings. One of them read:

Freelancers consistently underprice retainers because they anchor on hourly rates rather than business value delivered.

Smart-PM test: would a sharp PM read this and ask a hard question? Yes, instantly. "Underprice relative to what benchmark? Anchor compared to what alternative? Consistently across which segment?" None of those answers were in my notes. The notes had two designers who priced low because they were nervous about losing the client, and one who priced high because she'd done it before. That's not "anchoring on hourly rates"; that's a confidence-and-precedent pattern. Move 1 (soften) applied: I rewrote it to "designers without prior retainer experience price defensively, regardless of hourly-rate logic." Sharper, smaller, actually in the data.

Closing coda

The interesting AI work isn't in the prompt; it's in the editorial discipline you wrap around the output. The rubric catches the four common failures, the four moves resolve them honestly, the memory layer prevents the same mistake twice, the convergence test gives you something close to internal validation, and the artefact at the end is something you can defend in a stakeholder room.

That asymmetry is the part that will keep mattering. Prompts are becoming commodity; editorial discipline isn't. The loop is what becomes defensible when the prompt itself is free.

P.S. The open question I'm sitting with: the order-of-application rule (restatement → strategy-slide → evidence-grounding → smart-PM) works because one editor has one ordering. With three editors you get three orderings, and the same finding survives one rubric pass and dies in another. The next post is whether a multi-editor or model-as-second-critic version of this loop can keep the discipline without losing the protocol. If you've ever applied a rubric like this and gotten a different ordering than a colleague, even informally, ping me; I'd love to compare notes.

P.P.S. After this shipped, the loop ran in reverse.

When the editor seat flips (human leads the synthesis, AI applies the rubric) the same four moves work. Tested on a journey-map rollup: candidate stage-level findings synthesised by hand, then a fresh AI run with the same protocol (rubric, then verdicts: CONFIRM / SOFTEN / DROP / DROP+REPLACE / ADD-NEW). The verdicts came back in the same shape the human-editing-AI loop produces: of 47 verdicts across 4 stages, roughly 75% confirms, 20% softer rewordings, one drop, and around ten real gaps surfaced unprompted.

What that suggests: the four moves are direction-agnostic. They are about how synthesis fails (framing overreach, restatement, wrong evidence, unfixable-at-this-level), not about who originated it. Which makes cross-direction convergence a sharper triangulation than cross-run convergence; when AI applied to human picks surfaces the same patterns the human applied to AI output, that's harder to dismiss than two same-direction runs.

The asymmetry that persists: when AI is the editor, you still read its verdicts critically; otherwise you've just shifted the over-trust failure mode one layer up. Practically a two-step loop: human synthesis, AI critique, human re-reads the critique. The third step is what catches AI-editor misses.

Partial answer to the multi-editor question above. Not two humans; one human plus one AI, both running the protocol. The discipline held.

☝Further reading. The closest published cousins to this loop are Self-Refine (Madaan et al., 2023) and Reflexion (Shinn et al., 2023). Both add a critique-and-revise pass on top of generation; the four moves push further by constraining the revision to four behaviourally distinct moves rather than open-ended self-edit.

Self-Refine: Iterative Refinement with Self-Feedback

Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improv

arXiv.org

Reflexion: Language Agents with Verbal Reinforcement Learning

Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language