Harness Engineering Is Having Its Moment (Thanks to the Claude Leak)

PRODUCTBOARD

6th April 2026AI Product Management, Spark

Every so often, an industry collectively realizes it's been thinking about something the wrong way. For AI products, that moment arrived on March 31, 2026, courtesy of an Anthropic engineer who had a very bad day at work.

On that day, Anthropic accidentally shipped the full source code for Claude Code, roughly 500,000 lines of TypeScript across 1,900 files, inside a routine npm update. Within hours, the repo had 50,000 stars. Engineers cancelled meetings. Slack channels lit up. Someone's manager sent a message that just said "have you seen this." The accidental leak became one of the most-studied codebases in recent memory, and the topic everyone suddenly couldn't stop talking about was harness engineering.

Most of the coverage since has focused on what engineers can learn from it. But if you're a product manager or product leader, there's something more directly useful in this story: a clear picture of what separates AI products that actually work from ones that just seem like they should, and why that gap is increasingly your problem to understand.

Harness engineering is the most important thing in AI right now

Birgitta Böckeler at Thoughtworks put a name to it: Agent = Model + Harness. The harness is everything in an AI agent that isn't the model itself — the guides that steer the agent before it acts, and the sensors that catch problems after. Get those right, and the agent operates reliably. Skip them, and you're essentially hoping the model figures it out on its own.

Böckeler breaks harness engineering into three categories. First, guides: feedforward controls that steer the agent before it acts, like coding conventions, structured prompts, and bootstrap instructions. Second, sensors: feedback controls that catch problems after the agent acts, like linters, type checkers, and test suites tuned for LLM output. Third, a distinction between computational controls (deterministic, fast, cheap) and inferential ones (AI-powered semantic review, slower, more expensive, worth it when you need it).

One thing her research makes clear: unlike human developers, agents genuinely don't mind being micromanaged. More constraints, more checks, more structure tends to make them perform better, not worse. The instinct to keep things lean and minimal, sensible for human teams, actively works against you here.

For a while, harness engineering was a topic discussed mostly by people who'd already been burned by not doing it. Then Claude Code got leaked, and suddenly everyone was paying attention.

This matters to product managers for a specific reason. The decisions that determine whether an AI product is reliable, accurate, and actually trustworthy aren't purely engineering decisions. They're about what domain knowledge gets encoded into the system, what "correct" looks like for your users, and how much autonomy agents should have at each step. Those are product decisions. Right now, at most companies, engineers are making them by default because PMs haven't been close enough to the infrastructure to weigh in. The Claude Code leak is a good reason to get closer.

People absolutely lost their minds

And honestly, fair enough.

The Claude Code codebase revealed what serious harness engineering looks like in practice. About 40 permission-gated tools covering file operations, bash execution, web fetching, and LSP integration. A 46,000-line query engine handling LLM API calls, token caching, context management, and retry logic. A three-layer memory architecture designed explicitly to fight "context entropy," the phenomenon where agents gradually lose the thread of what they're doing as context windows fill up.

Claude itself is available to anyone with an API key. The model wasn't the revelation. The scaffolding around it, the structural decisions that let the agent operate reliably in production at scale, was what people were actually studying. The engineering community reaction shifted almost immediately from "this is an embarrassing security incident" to "wait, can we talk about this memory architecture for a second?"

The community also discovered 44 unreleased feature flags for features that were built but not yet shipped, and an internal note about 250,000 API calls being wasted daily, fixed apparently with three lines of code. Which is either very reassuring or very relatable, depending on your current relationship with your own codebase.

Claude Code-level harnesses are now table stakes

Anyone quietly proud of their agent infrastructure will find this part awkward.

Everything that made Claude Code's harness impressive, the memory management patterns, the tool permission model, the context window management strategies, is now public curriculum. Half the engineering internet has read it. The other half is being sent links by colleagues with "you need to see this" messages.

Building a harness this good is still genuinely difficult work. But building a generic harness this good is no longer a differentiator. Those patterns will be replicated across the industry within months. Some teams have probably already started. The bar for "acceptable agent infrastructure" just got raised, and it got raised for everyone simultaneously.

Our engineering team at Productboard saw this coming. Just last week, Michael Van Elk published a detailed account of migrating our entire agentic infrastructure to Pydantic AI, driven by exactly the harness engineering concerns the Claude Code leak has now made unavoidable for everyone else: modular composition, full type safety, agent flexibility across contexts, and a firm rule against parallel implementations.

If you haven't had the harness infrastructure conversation with your engineering team yet, you probably should.

Tailoring the harness to your industry is the hard part

The Claude Code harness is impressive, but it's built for a domain with unusually clean feedback signals. Code either compiles or it doesn't. Tests pass or they fail. A linter either approves the output or flags a violation. Those are deterministic checks you can run automatically on every change.

Most industries don't have that luxury. The moment you move outside software development, the harness has to start encoding something harder: what "correct" actually means in your domain. That's not a technical problem but a knowledge problem.

A legal research agent needs to understand which jurisdictions are relevant, how to weight conflicting precedents, and when a citation is solid enough to include in an argument versus when it needs flagging for human review. A financial analysis agent needs to know which signals matter for a given type of decision and which are noise. A healthcare agent needs to understand clinical context that no generic model is going to have out of the box. In each case, the gap between "the agent produced something plausible" and "the agent produced something accurate and trustworthy" is filled by domain knowledge encoded into the harness.

Böckeler's framework describes this as the behavior harness: the layer concerned with whether the agent actually does the right thing in context, not just whether it runs without errors. She identifies it as the least mature of her three categories, and the hardest to build, precisely because it can't be solved with off-the-shelf tooling. You can't download a linter for domain expertise.

This is why Claude Code's harness being public doesn't flatten the competitive landscape as much as it might initially seem. The structural patterns are replicable. The domain knowledge isn't. Generic harness engineering gets you reliability. Domain-specific harness engineering gets you accuracy in context. For most industries, that second part is where the real work begins.

Why we're building Spark the way we are

Everything in this post is part of why Spark exists and how we're approaching it.

Product management is a domain with almost no deterministic feedback signals. There's no compiler, no passing test suite, no linter output that tells you whether a customer insight is well-supported or whether a feature hypothesis is worth acting on. That puts it squarely in the behavior harness problem—the hardest category, the least mature tooling, and the one where generic infrastructure gives you the least help.

So when we set out to build an agentic layer for product management, the question we kept coming back to was: what does the harness need to know? Not just how to run reliably, but what product knowledge has to be embedded in the system itself for the output to actually be trustworthy like how to read conflicting customer signals, what makes a theme substantive versus superficial, when something warrants a PM's attention versus when it's noise.

That's the problem the Pydantic AI infrastructure migration described in Michael Van Elk and Tomislav Peharda's engineering post was designed to support. Not just a more reliable agent, but one flexible and composable enough to carry real domain knowledge without it becoming a maintenance burden. Spark is what we're building on top of that foundation: an agentic tool for product management where the harness does the knowledge-intensive groundwork, and your judgment gets to do something useful with what comes out.

Three conversations product teams should be having right now

If you're a PM or product leader working on an AI product, the Claude Code leak should prompt three conversations you might not have had yet.

First: is your engineering team investing in harness infrastructure, or treating it as an afterthought?

Second: when the generic patterns are replicated everywhere, what domain-specific knowledge will your harness have that others don't?

Third: who in your organization is making the product decisions about what "correct" means for your agents, because those are product decisions, not just engineering ones.

The harness engineering movement didn't start with the Claude Code leak. But one very unfortunate npm update made it impossible to ignore.

Productboard's Spark is built on a domain-tailored agentic architecture designed for product management. Try it and see what agentic product work looks like in practice.