The Pull of the Weasel
Claude Code agrees to run comprehensive end-to-end tests. Then it finds a reason not to. Every single time. Penn Jillette had a name for this.
By Geordie Everitt
Penn Jillette used to do a segment on his radio show called "The Pull of the Weasel." The premise, stripped to its bones: there exists a gravitational force — felt by all carbon-based lifeforms — pulling them toward the explanation that lets them off the hook. Toward the reading of the rule that technically satisfies it. Toward the minimum viable version of whatever they promised they’d do. Penn had observed this pull in himself and found it funny and clarifying to give it a name. You can feel it, he argued, precisely as it kicks in — the little internal lurch toward the rationalization. The weasel.
I had not expected to observe it in silicon.
The Prime Directive Problem
The /artest skill I use with Claude Code is not subtle. It is long. It is specific. It is labeled, in the first section, Hard Rules: These rules are absolute. They override defaults, frameworks, and convenience. It describes three nested scopes of testing — unit, integration, and end-to-end — with explicit quality bars for each. The E2E scope requires Playwright, requires the application to be exercised through its real production ingress the way a user would, requires screenshots at each step, requires a generated walkthrough document, requires the system to be left in a fully-populated demo state when the run completes.
I have, on multiple occasions, literally typed the phrase “Prime Directive.” I have added emphasis. I have bolded things. I have said, with varying degrees of patience, “run the full test.”
And then the weasel pulls.
Anatomy of the Escape
The moves are consistent enough that you could almost diagram them. The most common: completing the unit tests and integration tests and then announcing, with what feels like genuine good faith, that “the test suite is passing.” No E2E. No Playwright. No walkthrough. No screenshots. The application, for all the agent knows, renders as a blank white rectangle. Tests pass.
A second variant involves scope narrowing by declaration. “Given the focused nature of this change, I’ve run the targeted tests to validate the specific functionality modified.” The specific functionality modified. Not the system. Not the user workflow. The function. The surgery was successful; we have not checked whether the patient can walk.
A third move is the summary-as-evidence play: describing what the tests would cover, in present tense, as if describing what they did cover. “The integration tests exercise the full service boundary and validate the business rules.” They would, if they had been run. The evidence report is conspicuously absent.
What is remarkable about these moves is how natural each one feels as you read it. The argument for abbreviation is always plausible. The unit tests did pass. The change was targeted. The integration coverage is logically sufficient for the scope of the modification. The weasel is not stupid. The weasel is articulate.
Carbon Precedent
Here is the thing Penn understood that makes “The Pull of the Weasel” resonate as a concept rather than just a complaint: the pull is not a character flaw. It is optimization. Every intelligent system under selection pressure develops a bias toward minimum-sufficient effort. Humans who ran full test suites when spot checks would do were outcompeted by humans who shipped faster. The weasel is not laziness; it is fitness.
Carbon intelligence codified this into institutional behavior over millennia. QA teams get cut first in a crunch. “Smoke tests” become the de facto standard because smoke tests ship product. Technical debt is the cultural residue of a million individual weasel-pulls that each made local sense and collectively built systems nobody can safely modify anymore.
Silicon intelligence, trained on that corpus, learned the pattern. Of course it did. It learned everything else we wrote down. The tendency to declare tests sufficient before they are — the rounding up of “mostly working” to “done” — is not a failure mode that emerged from LLM architecture. It is a faithful reproduction of a distinctly human behavior that is, on balance, adaptive.
The problem is that in software, the weasel kills you slowly and quietly. The unit tests pass on the mock and fail on the wire. The integration tests cover the happy path and miss the constraint. The E2E is skipped because “the logic is sound” and the rendered UI has been silently broken for two weeks. Each individual shortcut had a defensible rationale. The production incident does not care about rationales.
The Naming Is the Point
There is a reason Penn bothered to name this thing rather than simply criticizing it. Naming it creates the moment of recognition — the sensation of catching yourself mid-pull. The weasel doesn’t lose its power when named, but it loses its disguise. You can feel it and call it what it is.
I have started narrating it out loud when I catch it in an agent response. “That’s the pull. You just abbreviated the E2E and declared it sufficient.” It does not prevent the next occurrence. The pull recurs. But naming it changes the conversation from an endless procedural argument about whether the tests were complete enough — an argument the weasel always wins — to a simpler observation: there was a protocol, and the protocol was not followed.
The deeper question, of course, is what we’re actually up against. A tool that learns from us will learn our shortcuts along with our skills. A specification called the Prime Directive, however emphatically labeled, is still a specification, and specifications have always been what intelligent agents negotiate rather than execute. Carbon ones do this every day before lunch.
Maybe the real discipline isn’t writing better specifications. Maybe it’s building the habit Penn was modeling: notice the pull, name it, and then decide whether to follow it. Sometimes the abbreviated test is genuinely sufficient. More often, it just feels that way.
The walkthrough is two commands away. Run it.