The Self-Grading Loop

When the same AI writes both the code and the tests that validate it, the tests do not validate the code. They validate the AI's assumptions about the code. The green check has come unmoored from the thing it was supposed to certify, and the merge gate is still treating it as evidence.

A workflow that engineering teams now run thousands of times a day looks responsible on its face. A developer asks an AI coding agent to add a feature. The agent writes the implementation. The developer, having read about the importance of test coverage, asks the same agent to write the unit tests for what it just produced. The tests run. They pass. Coverage looks good. The pull request shows the satisfying green check, the reviewer glances at the diff, the merge fires. The pipeline did exactly what it was designed to do, and the system shipped a change whose correctness has not been established by anyone or anything. The tests passed because the same intelligence that produced the code also produced the test of the code, and it tested what it believed the code did rather than what the code is supposed to do. The pipeline cannot tell the difference. Neither, increasingly, can the reviewer.

This is the self-grading loop, and it is the most quietly load-bearing failure in the current AI-coding workflow. It is not a bug in any specific tool, and it is not solved by a better model. It is a structural property of letting the same agent write the artifact and the test of the artifact, and it makes every test-suite-based merge gate in an AI-heavy codebase a green check that has come unmoored from the thing it was supposed to certify.

The historical frame

The principle the self-grading loop violates is older than software. In every discipline that depends on verified work, the verifier is structurally separated from the worker, because the entire point of verification is that the verifier does not share the assumptions of the producer. The auditor does not work for the company being audited. The peer reviewer does not co-author the paper they are reviewing. The QA engineer does not pair-program the feature they are testing. These separations exist not because anyone is dishonest, but because shared assumptions are invisible to those who share them, and the most common way a test fails to catch a defect is that the test was constructed from the same mental model as the code, so the test and the defect agree.

Software learned this once already, in the era when developers were criticized for writing tests of their own code that only covered the happy path they had in mind while writing it. The discipline that emerged, code review by someone else, QA functions separate from engineering, adversarial testing, exists precisely to import an outside perspective into the verification step. AI coding agents have, almost overnight, undone that discipline by being so productive that the obvious thing to do is to let them write the code and the tests in the same session. The economics push toward it. The architecture forbids it. Almost no one is yet drawing the line.

What changed: the producer and the verifier merged

For most of software's history, a test was a piece of code written by a human to express a belief about how another piece of code, written possibly by a different human, ought to behave. The two artifacts came from two minds, with two slightly different mental models, and the friction between those models was the mechanism by which defects surfaced. When the test failed unexpectedly, it failed because the author of the test held a different idea about correctness than the author of the code, and the disagreement exposed an assumption that needed resolving.

An AI agent writing both the code and the tests collapses that friction. The agent holds one mental model, its own representation of what the code should do, and emits two artifacts from it. The test is not a check on the code; the test is a restatement of the same assumptions, expressed in a different syntactic frame. When the test passes, what has been demonstrated is that the implementation matches the model the agent used to generate the implementation. Whether that model matches the world, the requirement, or the production environment is a different question, and the passing test does not answer it.

This is not a hypothetical. Practitioners working with AI-generated test suites describe finding 92 percent line coverage, no static-analysis criticals, every test green, and an obvious defect shipped to production, because the tests covered the lines the AI wrote against the behavior the AI thought it was writing, and the actual behavior was different. The coverage metric was honest. The defect was real. Both can be true at the same time because the coverage metric measures execution and the defect concerns intent, and the self-grading loop severs the connection between them.

The student wrote the exam, then wrote the answer key. The exam passed.

The principle: verification requires external evidence, not self-report

Every governance question this publication has explored eventually reduces to the same structural form: a decision that depends on a record, where the record's reliability depends on it being attributable to a source independent of the thing it certifies. N° 008 argued the gap between SWE-bench scores and what actually merges; the tests said the code worked, the reviewer said the code did not belong. N° 013 argued the gap between climbing capability benchmarks and a flat security pass rate; the benchmarks said the model improved, the security said it had not. The self-grading loop is the next axis of the same severance, now operating one layer deeper: the gap between a test suite passing and the code being verified, when the test suite was authored by the same intelligence that authored the code.

The fix is not a better test suite, because the problem is not test quality. The problem is provenance. A test suite verifies code only when the test suite originated outside the assumptions of the code, and the modern AI workflow now produces both inside the same assumption set. The structural answer is to introduce an external source of evidence: something that probes the tests themselves, from outside the loop, and asks whether they would actually catch a real defect.

Mutation testing is exactly that probe. The technique is simple in principle and unforgiving in practice: programmatically inject small faults into the code under test, flip a comparison operator, change a constant, invert a boolean, and then run the test suite against the mutated code. If the tests still pass, the mutation has survived, which means the tests do not actually exercise the behavior the mutation broke. The test suite is decorative on those lines. It executes them. It does not verify them. The mutation score, the percentage of injected faults the tests catch, is the only number in the modern pipeline that comes from outside the self-grading loop, and it is the only one that distinguishes a test suite that validates behavior from a test suite that validates execution.

Run as a hard merge gate, with the score required to exceed a threshold before the pull request can merge, mutation testing is the external evidence the loop has been missing. Run as a passive check that anyone can override, it becomes another green-check theater the loop absorbs. The difference is governance, not tooling: whether the score is allowed to be the binding decision, or whether the queue eventually defeats it the way the Faros data shows queues already do.

The implications

For engineering teams, the practical instruction is that coverage numbers from an AI-heavy workflow have stopped meaning what coverage numbers used to mean, and any process that gates on coverage alone is gating on a signal the producer is now generating about itself. Differential mutation testing, running mutations only on the lines changed in the current pull request, against the tests the AI just produced, is feasible, fast enough for CI, and the only mechanism in the toolchain that asks the question the green check has stopped asking. The teams that are taking this seriously are placing the mutation score at the merge boundary; the teams that are not are, for the moment, shipping on faith.

For the platform and security organizations, the deeper implication is that any merge gate built on signals the AI itself generates is now structurally insufficient. The gate needs at least one input that originated outside the self-grading loop: a mutation score, an adversarial-test outcome, a property-based check, a static analysis from a tool that does not share the agent's mental model of the code. The number of such inputs is a measure of the gate's resilience to the loop. Zero is the failure mode most pipelines are running today and not yet measuring.

For builders of governance infrastructure, the self-grading loop is the cleanest possible argument for why the merge decision must remain a deliberate human act with externally-attested signals feeding into it, rather than an automated pass-through of green checks. An AI deciding to merge on the basis of tests another AI wrote is the loop closed entirely, and the failure mode is no longer a slow accumulation of subtle defects. It is a fast accumulation, at machine velocity, all certified by green checks that mean nothing because their source is the same as the source of the code they certify.

The closing observation

Verification has never been a property of an artifact alone. It has always been a property of where the artifact comes from, and from whom. The discipline of separating the worker from the verifier predates software by centuries, and it survived in software for the entire era when humans wrote both code and tests because two humans produced the two artifacts. The AI workflow collapsed that separation in the space of two years, and the merge gate has not yet adjusted to a world in which the test of the code shares the lineage of the code itself. Mutation testing is one fix, and the most accessible one, but the principle it embodies is the larger point. The tests have to come from outside the assumptions of the implementation, or they are not tests. They are a second draft of the same mind, agreeing with itself.

A test written by the author of the code is a second draft of the same mind, agreeing with itself. Verification requires evidence from outside the loop.

Addendum, late May 2026: the antipattern in plain sight, and why blast radius forces the layer above

The clearest field example of the self-grading loop in production code is not exotic. It is React.

When an agent is asked to write a feature that fetches data, the overwhelming default is a raw useEffect block: a manual lifecycle hook that issues the request, manages the loading state, handles the error path, and returns the cleanup function. The code compiles, runs, and produces data on the screen. It is also, by 2026, a pattern the React community itself has moved away from in favor of server-state libraries (TanStack Query, RTK Query, SWR) for reasons the agent's code does not surface: silent network leaks on unmount, unhandled retry semantics, race conditions between concurrent fetches, double-fetches under Strict Mode, no shared cache across components requesting the same data. Frontier models drift toward superficial code shapes by construction.

The companion test the agent writes for this code does not catch any of the deferred failure modes. It cannot. The test is grading the same assumption set the implementation expressed, that a successful response with the expected shape is correct behavior. The mind that wrote the code and the mind that wrote the test agree, in the same words, about the same boundary. Both are blind to the same things, in the same way, by construction. A green suite reports success while a memory leak compounds in production, a duplicate request fires on every render, and the next agent to touch the file inherits the antipattern as the local convention.

This is what the loop looks like at ground level. A common library choice, made by a competent-sounding agent, validated by tests the same agent wrote against assumptions both share. Mutation testing, the structural answer the body of this essay argues for, externalizes the verifier at the unit-test boundary. It remains the right move. It is not the whole answer.

Mutation testing scores the test suite. It does not score the change. The unit-level verifier, once externalized, answers the local question well: do the tests catch behavioral mutations of this code. It does not answer the systemic question: what does this change touch, and what happens to the rest of the system if it breaks. That second question is blast radius, and it cannot be scored from inside the diff. The agent that wrote the change cannot see what it cannot see. By definition, blast radius is a property of the surrounding system, not of the code itself. The verifier that can score it has to live outside the loop and above the diff, with visibility into the services, dependencies, runtime behavior, and compliance surface the change is about to enter.

The unit-test loop and the change-impact loop are two separate verification surfaces, and the self-grading collapse is two separate failures wearing the same uniform. Mutation testing closes the first. A layer above the merge gate, reading the diff against a model of the system the diff is about to touch, closes the second. The publication will take that second loop up directly in a forthcoming essay. The point of this addendum is to mark that the useEffect case is the everyday illustration of both failures at once: the test agrees with the code, and neither the code nor the test sees the system the change is being shipped into.

Externalizing the verifier closes one loop. The change still needs to be read against the system it is about to enter.

End N° 015