• OpenAI ships multimodal updates • EU AI Act compliance dates clarified • Anthropic releases new safety evals • NVIDIA earnings beat expectations • New open-source LLM hits SOTA on MMLU
AI token costs

AI Was Supposed to Cut Costs — Now It’s Burning Budgets Faster Than Salaries

The pitch was clean: replace headcount with compute, cut costs, move faster. Nobody stress-tested what happens when the compute bill outgrows the payroll it was supposed to eliminate.

That’s where some companies find themselves now. Uber’s CTO Praveen Neppalli Naga already burned through his entire 2026 AI budget — on token costs alone — before the year hit its midpoint, according to The Information. Bryan Catanzaro, Nvidia’s VP of applied deep learning, put it plainly to Axios: for his team, compute costs now exceed what they spend on the people writing the code.

That’s not a humble brag. That’s a warning.

The Token Economy Has No Price Ceiling

When AI labs first came for enterprise budgets, the argument was arithmetic: one API call costs fractions of a cent; a developer costs six figures a year. The math made the pitch easy. But token consumption at scale doesn’t behave the way anyone modeled — especially with agentic workflows that chain dozens of model calls together.

The problem isn’t just volume. It’s recursion. In 2026, a specific failure mode has emerged that engineers are calling Recursive Token Loops — agentic systems that get stuck cycling through the same subtasks, revalidating outputs, re-querying context, burning hundreds or thousands of dollars in a single session before anyone notices. There’s no manager to tell the agent it’s wasting money. The meter just runs.

Anthropic has already adjusted its pricing to account for a demand spike. Meanwhile, Amos Bar-Joseph, CEO of Swan AI, went viral on LinkedIn boasting about his Anthropic bill as proof of ambition — “scaling with intelligence, not headcount,” he wrote, as if a large invoice were itself a metric for a post-labor economy.

The CFO’s Choice vs. The Engineer’s Choice

There’s a quiet lab war playing out inside this cost pressure — and it’s less about performance than about burn rate.

Claude Code currently sits at roughly 87.6% on SWE-bench Verified, making it the premium option for complex refactors and multi-file reasoning. But it uses approximately four times more tokens than Codex (GPT-5.4) for equivalent tasks. A standard Express.js refactor costs around $180 in tokens on Claude Code versus roughly $45 on Codex. That $135 delta — multiplied across hundreds of daily tasks — is what’s now driving procurement conversations.

The table below captures where that tradeoff actually matters in practice:

Task Claude Code Codex (GPT-5.4) Recommended
Legacy Codebase Migration ~$180/task · 94% accuracy ~$48/task · 71% accuracy Claude Code — mistakes here cost more than tokens
Unit Test Generation ~$22/task · 91% accuracy ~$6/task · 88% accuracy Codex — marginal accuracy gap doesn’t justify 3.7x cost
Boilerplate / Scaffolding ~$14/task · 89% accuracy ~$4/task · 86% accuracy Codex — near-identical output, fraction of the cost
Multi-file Architecture Refactor ~$210/task · 92% accuracy ~$55/task · 67% accuracy Claude Code — reasoning depth is load-bearing here
Docstring / Comment Generation ~$9/task · 90% accuracy ~$2.50/task · 88% accuracy Codex — pure formatting work, no premium justified

CFOs don’t care about benchmark rankings. They care about that column on the right. An OpenAI investor told Axios they see the rising token cost problem as a direct tailwind, positioning Codex as the frugal frontier option. The competition among labs is no longer just who builds the smartest model — it’s who can deliver capable output without igniting a client’s budget in an already turbulent layoffs landscape.

Surviving Companies Are Already Routing Around the Problem

The enterprises quietly winning this budget war aren’t spending less — they’re routing smarter.

The playbook: keep frontier models for high-reasoning tasks — architecture decisions, complex debugging, nuanced drafting. Route everything else — boilerplate, formatting, classification, summarization — to locally hosted Small Language Models via distillation. Token burn drops dramatically. Output quality stays intact where it matters.

This isn’t a theoretical workaround. It’s becoming standard practice among engineering teams serious about the AI profitability gap now showing up on quarterly calls.

The Kill-Switch Every Agentic Stack Needs

Human-in-the-Loop budget triggers aren’t a philosophy — they’re three lines of logic that every agentic workflow should have running by default:

if session.token_spend > BUDGET_THRESHOLD:
    pause_agent()
    notify_human(f"Session exceeded ${BUDGET_THRESHOLD}. Authorize to continue.")
    await human_approval()

Simple. But almost nobody is shipping it. The engineering teams have cut runaway session costs by 60–80% without meaningfully slowing output. The ones that aren’t are the ones filing incident reports about four-figure single-session burns.

This is the gap that’s quietly creating a new role inside engineering orgs: the AI Orchestrator. Not a prompt engineer. Not a data scientist. Someone whose job is specifically to manage model routing decisions, set and monitor budget triggers, audit token spend against output value, and own the infrastructure that keeps agentic workflows from becoming autonomous money fires. Job postings for this title have grown sharply in early 2026, as companies recalibrate what it actually costs to run AI at scale — and realize they need humans to manage the machines managing the work.

It’s one of the stranger reversals of the “replace headcount with compute” thesis: you eventually need to hire someone to watch the compute.

The Metric Nobody Is Using Yet

Companies are still measuring AI spend the wrong way. “Cost per seat” made sense for SaaS. It makes no sense for inference. The unit that actually matters in 2026 is Cost per Successful PR — how much token spend did it take to produce a merged, reviewed, production-ready pull request? That reframes the entire conversation from budget line to output accountability.

Without that metric, AI spending is just a faster way to misallocate. And the productivity paradox is real — token spend is climbing while measurable output gains remain stubbornly hard to quantify at the org level.

What the Reckoning Actually Looks Like

Brad Owens, VP of digital labor strategy at workforce orchestration firm Asymbl, told Axios the internal conversation is already shifting: “The tone is shifting a bit more into what is the true value of a worker… human or digital?”

That’s a sharper question than it sounds. Labor is expensive — but labor comes with self-limiting behavior. A developer who spends four hours going in circles eventually stops, asks someone, or escalates. An agentic workflow doesn’t. Gartner projects worldwide IT spending will hit $6.31 trillion in 2026 — up 13.5% from last year — and shareholders answering to quarterly earnings calls will eventually demand proof the spend is generating returns, not just generating tokens.

The companies most exposed aren’t the ones that deployed AI. They’re the ones who deployed AI without building the feedback loops to understand what they’re getting back. That gap — between token spend and verified output value — is where entry-level knowledge work is quietly disappearing while the underlying economics get harder to justify at scale.

The AI bill isn’t going down. The question is whether the kill switch is in place before the next session runs.

Related: Claude vs ChatGPT (2026): Which AI Actually Wins for Coding, Writing & Automation?

 

Tags: