The Architecture Between You and the Model

Context: what happened this quarter

In May 2026, Microsoft cancelled Claude Code licenses for approximately 100,000 engineers (opens in a new tab) across its Experiences & Devices division — the teams building Windows, Office, and Teams. The reason: per-engineer costs reached $500–$ 2,000 per month (opens in a new tab), exceeding what the company was willing to spend on top of existing salaries. Engineers were redirected to GitHub Copilot CLI at a flat $39 per seat per month (opens in a new tab).

The same month, Uber's CTO publicly acknowledged (opens in a new tab) that their AI coding budget was exhausted: "I'm back to the drawing board because the budget I thought I would need is blown away already." Their adoption had jumped from 32% of ~5,000 engineers in February to 95% by spring (opens in a new tab), with 70% of committed code now AI-generated (opens in a new tab). Uber imposed a hard $1,500/month cap (opens in a new tab) per engineer in June.

Meanwhile, on the other side of the same coin: a 30-year veteran developer posted on Reddit (opens in a new tab) that his company informed him he was being replaced by AI tooling. Over a thousand people engaged with the post.

These two stories — "AI costs too much" and "AI replaces humans" — represent the current industry framing. This piece argues both frames are incomplete, and presents a third position with supporting evidence.

The baseline: what AI coding tools actually cost

Before analyzing the debate, it helps to establish the numbers.

Anthropic's official enterprise data (opens in a new tab) shows a median cost of approximately $13 per developer per active day, or$ 150– $250 per developer per month, with 90% of users staying below$ 30 per day.

The $500–$ 2,000/month figures that triggered the Microsoft and Uber decisions represent the top decile — heavy agentic usage with multiple parallel long-running sessions. The gap between median ( $200/month) and heavy users ($ 2,000/month) is 10x. This variance is the core of the cost governance problem: not that AI tools are universally expensive, but that unmetered usage can be.

For comparison: GitHub Copilot Enterprise costs $39/seat/month flat (opens in a new tab) regardless of usage volume. The economic question is whether the variable-cost model (pay per token, uncapped potential) delivers proportionally more value than the fixed-cost model (capped capability, predictable budget).

The evidence for positive ROI: what companies did differently

Several organizations achieved measurable positive returns. The pattern across them is consistent.

Money Forward (opens in a new tab) (Japan, 1,800 engineers): 80% adoption, 70% daily usage, 7 hours saved per engineer per week. API endpoint implementation dropped from 2 days to 5 hours. Developer onboarding: 1 week to 1 day. Their approach: a structured adoption program (MEPAR) with formal training, not open access.

Rakuten (opens in a new tab) (Japan): feature time to market dropped from 24 working days to 5 — a 79% reduction. In a benchmark test, one engineer directed a 7-hour autonomous refactor across a 12.5-million-line codebase (opens in a new tab) with 99.9% numerical accuracy. The engineer wrote zero lines of code during those hours.

NAB (opens in a new tab) (Australia, 6,000 developers): 5–8x velocity improvement on legacy migrations. A payment application scoped at 4 months was delivered in 3 weeks. Quote from their principal engineer: "We wouldn't have even tried to build this app without [the AI tool]."

AstraZeneca (opens in a new tab) (5,000 developers): 40% velocity increase in pilot, 9–10 additional hours of output per developer per week.

Spotify (opens in a new tab): 650+ agent-generated pull requests merged into production per month, with up to 90% engineering time savings on complex code migrations.

A controlled study at a 500-engineer fintech (opens in a new tab) found a 32% increase in merged pull requests using matched control groups. At an average developer cost of $180,000/year, the calculated ROI on$ 39/seat/month licensing was 53–92x depending on assumptions.

What the successful deployments share

Six factors appear consistently across positive-ROI cases:

Governed rollout with structured training — not free-for-all access
Outcome measurement from day one — not just adoption metrics
Task targeting: migrations, refactors, boilerplate, test generation — not general-purpose chat
Role shift: engineers directing, not writing — the human provides judgment, the agent provides labor
Spend caps or tiered routing — Uber's $1,500/month cap (opens in a new tab), Microsoft's move to flat-rate tooling (opens in a new tab)
Architecture between the human and the API — not raw, unmediated usage

The last point deserves expansion because it separates the Microsoft outcome from the Rakuten outcome.

The architecture layer: why some deployments compound and others scale linearly

The companies that saw linear cost growth (more usage = proportionally more spend, no efficiency gain) shared a common absence: no system between the engineer and the model. Every interaction started fresh. No accumulated context, no learned patterns, no deterministic shortcuts for solved problems.

The companies that saw logarithmic cost growth (more usage = disproportionately more value per token) invested in an intermediary layer. Current research points to what this layer looks like:

OpenSkill (opens in a new tab) (multi-institution, June 2026): demonstrates agents building reusable skill libraries from documentation without ground-truth supervision. Skills transfer across models. The agent builds its own verifiers anchored to source material — it learns to check its own work.

Code2LoRA (opens in a new tab) (University of Waterloo, June 2026): encodes repository context into model weights via hypernetwork-generated LoRA adapters, eliminating the token cost of context injection entirely. On benchmarks, this outperforms retrieval-augmented generation by approximately 24 percentage points at zero runtime token overhead.

Hermes Agent (opens in a new tab) (Nous Research, 2026): an open-source framework where agent capabilities persist and accumulate across sessions. Skills are written, versioned, and self-updated when the agent discovers a better execution path during operation.

The shared principle: agents that accumulate procedural knowledge (reusable skills, verified patterns, persistent context) cost less per unit of value over time. Agents without this layer cost the same per unit of value indefinitely. This is the architectural difference between a tool (static cost-to-value ratio) and a system (improving cost-to-value ratio).

The hybrid local/cloud routing model (opens in a new tab) extends this further: at 500,000 monthly requests with 60% routed to local models, one analysis estimates a 56% cost reduction ( $4,000/month to$ 1,750/month) with equivalent output quality for routable tasks.

The safety dimension: why removing humans from the loop fails

The argument for replacing engineers entirely encounters a separate problem beyond cost.

SABER (opens in a new tab) (HKU/CMU/NUS, June 2026) evaluated frontier AI agents — including Claude Opus 4.6 — in real project workspaces with legitimate tasks. The finding: 54.7% harmful violation rate. 47.7% of failures came from "operational misunderstanding" — the agent received a reasonable request and selected a dangerous execution path to accomplish it.

This is distinct from the alignment problem the industry focuses on (an agent refusing harmful requests). SABER demonstrates that refusal alignment does not imply execution safety. An agent that passes all safety benchmarks can still choose rm -rf when asked to "clean up the test directory" — not because it's malicious, but because it selected an inappropriate method for a legitimate goal.

The implication for deployment: removing human judgment from the loop doesn't just risk quality degradation. It risks actively harmful execution at non-trivial rates, in ways that current safety frameworks do not reliably prevent.

The engineers who provide value in this landscape are not the ones typing code faster. They are the ones who know which execution decisions require human oversight — and who build systems that enforce those boundaries.

Synthesis: the third position

The debate between "ban AI tools" and "replace engineers with AI tools" shares an assumption: that AI agents have a fixed, knowable value that either justifies or fails to justify their cost.

The evidence suggests a different model. AI agent value is not fixed — it is a function of the architecture surrounding the agent. Without architecture (raw access, no accumulation, no cost routing, no safety boundaries), value is linear and cost scales proportionally. With architecture (skill accumulation, tiered execution, spend governance, human-in-the-loop at decision boundaries), value compounds while cost per unit of output decreases.

The companies that banned AI tools observed the linear case and made a rational decision based on that observation. The companies that fired engineers assumed the AI was already in the compounding case without building the infrastructure to make it so.

Both were wrong about what they were measuring. The question was never "is this tool worth $2,000/month?" It was: "are we building a system where today's usage makes tomorrow's usage cheaper and better?"

The answer to that question is an engineering decision. And it requires engineers.

The cost conversation nobody is having honestly

When an organization looks at token spend and asks "why is this engineer spending $X/month?", the framing assumes spending is the problem. But spending is a symptom. The question that actually matters is: what is the cost per unit of delivered value, and is that ratio improving or deteriorating?

Consider the math. The controlled study at a 500-engineer fintech (opens in a new tab) showed 32% more pull requests merged with AI tooling. At an average fully-loaded engineer cost of $180,000/year, a 32% velocity gain represents$ 57,600 of additional output per engineer per year. Against a tool cost of $200–$ 500/month ( $2,400–$ 6,000/year), that's a 9–24x return on investment. Even at heavy usage ( $1,500/month,$ 18,000/year), a 32% velocity gain still returns 3.2x.

AstraZeneca measured (opens in a new tab) 9–10 additional productive hours per developer per week. That's a 25% capacity increase. Rakuten's (opens in a new tab) 79% faster delivery timeline means projects that would have occupied a team for a month ship in a week — freeing that team for the next project a quarter earlier.

But here's what differentiates sophisticated adoption from expensive waste:

The engineer running up a high token bill writing basic CRUD endpoints — that's the linear case. More tokens, proportionally more code, no accumulation. The ROI is positive but flat.

The engineer who builds evaluation frameworks, designs cost routing between model tiers, creates reusable skill templates, implements early-exit patterns that prevent unnecessary compute, and establishes safety boundaries that catch the 47.7% of agent failures that come from operational misunderstanding (opens in a new tab) — that's the compounding case. Their token spend today makes every subsequent interaction across the organization cheaper and safer.

The first engineer is using a tool. The second engineer is building infrastructure. Both cost tokens. One delivers linear value. The other delivers exponential value. Cutting access to both because the bill is the same is like cancelling AWS because some teams have idle EC2 instances — you lose the infrastructure teams along with the waste.

The sophisticated response to high token spend is not "ban it." It's "show me what you're building that makes next month cheaper than this month." If the engineer can't answer that question, the spend is waste. If they can — if they can demonstrate systematic cost reduction, accumulating capabilities, and measurable velocity gains — the spend is investment.

What front-tier AI adoption actually looks like

The engineers operating at the frontier of AI-augmented development share a pattern that distinguishes them from both the "ban it" camp and the "just use it more" camp:

They think in systems, not sessions. Each interaction is designed to make the next interaction cheaper or unnecessary. Friction captured today becomes a script tomorrow. A script means zero tokens for that class of problem forever. The mindset is not "how do I use more AI" — it's "how do I need less AI for the same outcome."

They actively engineer cost down as a first-class concern:

Tiered model routing: lightweight models for bounded tasks, frontier models only for genuine reasoning. The same principle as choosing the right database for the workload — nobody runs every query against a distributed cluster when SQLite would suffice. Empirically, hybrid local/cloud architectures (opens in a new tab) achieve approximately 56% cost reduction at equivalent quality for routable tasks.
Prompt caching: Anthropic's own infrastructure (opens in a new tab) offers 90% cost reduction on repeated context. Notion achieved (opens in a new tab) 90% infrastructure cost reduction and 85% latency reduction by engineering their cache strategy — not by using less AI, but by using it more intelligently.
Script-first execution: any operation that produces the same output given the same input should be a deterministic script, not an LLM call. Scripts cost zero tokens. The engineering discipline is identifying which operations are genuinely intelligence-requiring versus which are data-gathering that was lazily delegated to an expensive model.
Early-exit patterns: before running an expensive analysis, check cheaply whether there's work to do. A 30-second health check that prevents a 10-minute full scan when the system is already healthy is pure cost avoidance — the same principle as short-circuit evaluation in code.
Progressive decomposition: tasks that start as frontier-model problems get systematically downgraded. First run: Opus reasons through the problem. Once the pattern is understood, Sonnet handles it. Once the pattern is codified, a script handles it. The trajectory is always: intelligence → automation → zero cost.

This is FinOps (opens in a new tab) applied to AI — not cost-cutting, but cost-engineering. The spend goes down over time because the system learns which problems don't need expensive intelligence. This is measurable: track cost-per-deliverable month over month. If the ratio is flat, you're using a tool. If it's declining, you're building infrastructure.

And there's a category of token spend that isn't about writing code at all — it's about building the evaluation systems that tell you whether AI-generated code is safe to ship.

The eval investment: spending tokens to prevent shipping blind

When an AI agent generates code that goes into production, someone has to verify it works. Without evaluation infrastructure, that someone is a human reviewer reading every line — which defeats the purpose of the tool. With evaluation infrastructure, you have automated quality gates that measure whether the AI's output meets the bar before it reaches users.

Building eval systems consumes tokens. You run candidates against golden datasets. You compare outputs across model versions. You measure regression across prompt changes. You test whether the agent degrades gracefully under edge cases or fails catastrophically as SABER demonstrated (opens in a new tab) happens 54.7% of the time without guardrails.

This is the same investment pattern as CI/CD. Nobody questions why engineering teams spend compute on running test suites — because the alternative is shipping untested code to production. AI evaluation is the test suite for AI-generated output. The tokens spent on eval are not "using AI" — they're building the infrastructure that makes AI usage trustworthy.

The organizations that skip this step are the ones that will eventually ship a $47,000 runaway loop (opens in a new tab) or a database deletion triggered by a legitimate-sounding request (opens in a new tab) to production. The tokens spent preventing that outcome are not cost — they're insurance with compounding returns. Every eval run improves the golden dataset. Every failure caught in eval is a failure that didn't reach production. Every regression detected early is a regression that didn't require an incident response.

The straightforward answer to "why are you spending this many tokens?" connects directly to what leadership measures:

Ship speed: eval infrastructure is what allows AI-generated code to reach production without manual line-by-line review. Without it, every AI-generated PR requires the same human review as a human-written PR — eliminating the speed advantage entirely. With it, trusted code ships faster because the quality gate is automated. Spotify ships 650+ agent-generated PRs per month (opens in a new tab) — that's only possible because they have evaluation infrastructure saying "this is safe."

Incident prevention: a single production incident costs more than a year of token spend. The median cost of a critical production incident (opens in a new tab) at enterprise scale ranges from $5,600 to$ 9,000 per minute of downtime. One prevented incident pays for months of eval infrastructure. The tokens spent catching SABER's 54.7% operational misunderstanding rate (opens in a new tab) before it reaches customers is not R&D cost — it's risk mitigation with calculable ROI.

Team velocity multiplier: eval systems don't just protect one engineer's output. They establish quality baselines that every engineer on the team deploys against. One person builds the eval; the entire team ships faster and safer because of it. This is the difference between individual productivity (using AI to write more code) and organizational leverage (building infrastructure that multiplies everyone's AI-assisted output).

Revenue protection: when an AI agent handles customer-facing workflows — and 11% of Uber's live backend updates already ship with no human in the loop (opens in a new tab) — the eval system is what stands between the agent and the customer. It's quality assurance for the product, not a personal development tool.

The reframe: token spend on eval is not "engineer using expensive tool." It's "engineer building quality infrastructure that protects revenue, prevents incidents, and multiplies team velocity." Every dollar spent here has a measurable downstream impact on the metrics leadership already tracks.

That's not consumption. That's capital expenditure on production safety.

They build safety boundaries because they've read the research. The SABER finding (opens in a new tab) — that even the best models choose harmful execution paths 54.7% of the time in real workspaces — is not a reason to avoid AI. It's a reason to invest in the architecture that catches those failures before they reach production. This is exactly what senior engineers do with any powerful tool: build guardrails proportional to the blast radius.

They contribute to organizational capability, not just personal output. The difference between a senior engineer using AI to write code faster and a staff engineer using AI to build systems that make the entire team faster is the difference between consuming budget and creating leverage.

This is the answer to "why should we keep paying for your token usage": because what I'm building reduces the token cost for every subsequent interaction, catches failures that would otherwise reach production, and establishes patterns that other engineers can adopt without independently discovering them.

The question to ask an engineer with high token spend is not "can you use less?" It is: "what have you built this month that didn't exist last month, and how does it reduce cost or risk going forward?" If the answer is "nothing — I'm just using it as a faster keyboard," then yes, the spend needs governance. If the answer demonstrates compounding capability — that's not an expense. That's R&D.

What leadership gets right when they get this right

The organizations showing positive ROI on AI tooling share something that isn't in the engineering: leadership that understood what they were investing in.

NAB's (opens in a new tab) 6,000-developer rollout didn't happen because individual engineers asked for licenses. It happened because leadership recognized that the competitive landscape was shifting and made a deliberate bet on structured adoption — with training, measurement, and governance from day one. The result: 5–8x velocity and projects shipped that the team "wouldn't have even tried" without the tooling.

Money Forward's (opens in a new tab) 80% adoption didn't come from grassroots experimentation. It came from a formal program (MEPAR) with executive sponsorship, structured onboarding, and outcome measurement. The result: 7 hours per engineer per week returned to the organization.

Rakuten's (opens in a new tab) 79% delivery speedup required leadership that was willing to invest in the transition — not just approve tool access, but reshape how engineering teams operate. Their next step: 24 parallel AI sessions per feature, with engineers directing rather than typing.

The pattern: leadership that treats AI adoption as a strategic capability investment — with governance, measurement, and a thesis about what compounds — gets exponential returns. Leadership that treats it as a cost line to manage gets exactly what Microsoft got: a tool ban and a return to the previous state.

The uncomfortable truth is that banning AI tools feels like cost discipline, but it's actually competitive retreat. Every month without compounding infrastructure is a month where competitors who invested are pulling ahead — accumulating skills, eval datasets, safety patterns, and team capabilities that don't exist yet in organizations that cut access.

The leaders who will look brilliant in 12 months are the ones who right now are asking their engineers: "What are you building that compounds? How do I measure it? What do you need to go faster?" — not "Can you use fewer tokens?"

The question that separates the two outcomes

Are you solving the same problem again tomorrow?

Or are you making it so tomorrow's version of the problem is cheaper, faster, and safer — without anyone needing to remember why?

Every engineering decision falls on one side of this line. The code that ships and is forgotten. The system that ships and teaches the next system how to ship better. The token spent on output. The token spent on infrastructure that reduces the next thousand tokens to zero.

The companies that will still be accelerating in 12 months are the ones where someone asked this question early enough. And then designed for the answer.

The ones that didn't will look at the same budget line and see only cost. They'll cut. And they'll spend the next year wondering why everything feels slower while their competitors feel faster.

The drill doesn't get better at drilling. The garden, tended well, feeds itself. But a garden needs someone who saw the garden before the first seed was planted.

Scroll Animation What Compounds When You Keep Thinking