The Model Is Only 10%: The Real Lesson of the New SDLC

TL;DR

Google’s May 2026 whitepaper, The New SDLC With Vibe Coding, argues that AI software development is moving from writing code to directing and verifying machine-generated work. The paper claims the model is only a small part of agent performance, while tests, tools, context, sandboxes and oversight carry much of the production risk.

Google’s May 2026 whitepaper The New SDLC With Vibe Coding argues that the main change in software development is a shift from hand-writing code to directing AI systems, with verification, tests and engineering controls now carrying much of the risk for teams using AI-generated code.

The paper, written by Addy Osmani, Shubham Saboo and Sokratis Kartakis, says AI coding agents are now common in professional software work. According to the paper, 85% of professional developers regularly use AI coding agents, 51% use them daily, and about 41% of all new code is AI-generated.

The authors frame AI coding as a spectrum. At one end is casual “vibe coding,” where developers prompt an AI tool, test lightly and accept output that appears to work. At the other end is “agentic engineering,” where AI systems work inside formal specifications, automated tests, evals, CI gates, sandboxes and human architectural review.

The paper’s central claim is that a working coding agent should be judged as the model plus its harness. In the paper’s rough framing, the model accounts for about 10% of the behavior, while the harness — prompts, tools, context rules, hooks, sandboxes, sub-agents and observability — accounts for the rest. The source summary cites a Terminal Bench 2.0 example in which an agent moved from outside the top 30 to the top five by changing only the harness while keeping the same model, and a LangChain experiment that reportedly raised an agent score by 13.7 points through changes to prompts, tools and middleware.

AI Dispatch · Field Notes
Google · Osmani, Saboo & Kartakis · May 2026

The model is only 10%

A Google whitepaper argues software’s biggest shift is from writing code to expressing intent. Its sharpest claim: the model you obsess over is the smallest part of the system — the scaffolding around it does the real work.

A spectrum, not a binary — the differentiator is how outputs get verified
Vibe Coding
Casual prompts · “does it seem to work?” · disposable code · high risk
Structured AI-Assisted
Detailed prompts + constraints · manual testing · features in real codebases
Agentic Engineering
Formal specs · automated tests + evals + CI gates · production scale · low risk
Tests verify the deterministic; evals verify the rest. Without both, it’s vibe coding — however clever the prompt.
The idea worth building your strategy around
Agent = Model + Harness
~10%
HARNESS — prompts · tools · context · hooks · sandboxes · observability
MODEL~90% IS YOUR SURFACE AREA, NOT THE PROVIDER’S
Outside Top 30 → Top 5 on Terminal Bench 2.0 by changing only the harness — same model.
“Most agent failures, examined honestly, are configuration failures” — a missing tool, a vague rule, a noisy context.
The economics: it’s a token-cost problem (CapEx vs OpEx)
Vibe Coding
Low CapEx · High OpEx
Looks free, hides debt: token burn (fix-it loops), maintenance tax (AI spaghetti), security remediation. Crosses over to 3–10× more per feature.
Agentic Engineering
High CapEx · Low OpEx
Pay upfront (specs, evals, context), then ship cheaply. Levers: context engineering for first-pass success + intelligent model routing — cheap models for the easy work.
85%
of devs use AI coding agents (51% daily)
41%
of all new code is AI-generated
~90%
of agent behavior is the harness, not the model
+19%
longer on some tasks (METR) — verification is the cost
The read

The clearest map yet of how serious AI development works — and mostly tool-agnostic. But it’s a Google funnel: the concepts are neutral, the on-ramps point to Gemini, Jules & the ADK. If the harness is 90% and it’s yours, your moat and your costs both live there — so own your scaffolding, route across models, and remember: AI amplifies whatever engineering culture it lands in.

Source: Osmani, Saboo & Kartakis, “The New SDLC With Vibe Coding,” Google (May 2026). Figures are the paper’s own, incl. METR & LangChain. Analysis is the author’s.
thorstenmeyerai.com

Engineering Discipline Becomes the Product

The argument matters because many software teams have treated AI coding gains as a model-selection problem: pick the newest or strongest system and wait for better output. Google’s paper says that framing misses where much of the production risk sits.

If the paper’s view is right, the cost of AI coding will depend less on a single subscription or model benchmark and more on the internal systems teams build around agents. That includes context management, test coverage, eval design, security checks, model routing and review culture. The source analysis describes the trade-off as high upfront engineering cost for “agentic engineering,” compared with lower initial cost but higher operational debt for casual AI coding.

The point is also relevant for security and maintainability. AI-generated code can move quickly into real codebases, but the paper argues that deterministic tests and non-deterministic evals are both needed to judge whether an agent’s work is correct, safe and aligned with project goals. Without that structure, the paper says teams may be using a more polished form of vibe coding even when prompts are detailed.

Coding with AI For Dummies (For Dummies: Learning Made Easy)

Coding with AI For Dummies (For Dummies: Learning Made Easy)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

How Vibe Coding Became a Spectrum

The term “vibe coding” came from Andrej Karpathy in February 2025 and was used to describe a loose style of AI-assisted programming in which users accept model output, paste errors back into the system and iterate until something works. The Google paper says the term has since been stretched too broadly, covering everything from prototypes to production engineering.

Google’s reframing separates quick experimental use from controlled production use. The paper treats casual prompting as acceptable for disposable scripts, prototypes and early exploration, but it draws a sharper line around high-stakes systems such as payments, infrastructure and security-sensitive code.

The source analysis also says the paper is mostly tool-agnostic in its concepts while still steering readers toward Google’s own ecosystem, including Gemini, Jules and the Agent Development Kit. That makes the paper both a technical argument and a vendor document; its claims about workflows can be assessed separately from its product direction.

“generation is solved; verification, judgment, and direction are the new craft”

— Osmani, Saboo and Kartakis, in the Google whitepaper

Amazon

software testing automation tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Adoption Data Needs Outside Scrutiny

The supplied material does not describe the sampling method behind the 85%, 51% and 41% adoption figures, so readers should treat those numbers as claims reported by the paper rather than independently verified market measurements.

It is also unclear how broadly the benchmark examples apply across companies, languages, codebases and team maturity levels. A harness that improves one agent on Terminal Bench 2.0 or in a LangChain experiment may not produce the same result in a large legacy system or a regulated engineering environment.

The cost claims are still developing. The source summary cites METR data showing some tasks took 19% longer with AI assistance, pointing to verification as a real cost. But the balance between time saved, token spend, review load and downstream maintenance will vary by team and project.

Amazon

AI development sandbox environments

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Teams Face Build-or-Buy Choices

The next step for engineering leaders is likely to be a closer audit of the systems around AI coding agents: which tools agents can use, what context they receive, how their output is tested, which evals measure quality, and where humans must approve changes before release.

The paper also points to a coming choice between owning the harness internally and relying on vendor-provided agent platforms. If the harness carries most of the behavior, it may become a source of cost control, reliability and competitive advantage. The open question is how much of that layer teams can standardize without locking themselves to a single model provider.

Amazon

observability tools for AI systems

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual news in this story?

Google published a May 2026 whitepaper, The New SDLC With Vibe Coding, arguing that AI-assisted software development now depends more on verification systems and agent harnesses than on the model alone.

What does “the model is only 10%” mean?

It is the paper’s rough framing, not a verified universal measurement. The idea is that prompts, tools, context, sandboxes, tests, evals and observability can shape most of an agent’s real-world behavior.

Does the paper say vibe coding is always bad?

No. The paper treats casual AI coding as useful for prototypes, experiments and disposable scripts. It argues that production systems need stronger controls, including tests, evals, CI gates and human oversight.

What is confirmed and what is claimed?

Confirmed from the supplied material: the paper was published by Google in May 2026 and makes the model-plus-harness argument. The adoption statistics, benchmark gains and cost framing are claims reported by the paper and source summary.

Why should developers and CTOs care?

The paper suggests that AI coding success may depend on engineering process, not only model upgrades. That affects budgets, security review, testing strategy and how teams decide whether to build or buy agent platforms.

Source: Thorsten Meyer AI

You May Also Like

A History of IDEs at Google

An in-depth look at how Google evolved its integrated development environment ecosystem from diverse choices to a unified cloud-based solution, highlighting key milestones and future plans.

Running local models on an M4 with 24GB memory

A detailed report on deploying and running local AI models on an Apple M4 MacBook Pro with 24GB memory, exploring capabilities, setup, and limitations.

OpenAI Is Making Billions Just by Promising to Buy From Suppliers

OpenAI is earning billions by committing to purchase from suppliers, raising questions about its business model and market influence.

First public macOS kernel memory corruption exploit on Apple M5

Researchers reveal the first public kernel memory corruption exploit on Apple M5 silicon, surviving hardware memory safety features like MIE, raising security concerns.