← Back May 20, 2026

Why Coding Agents Are Getting More Expensive (And How To Fix It)

Prompt Caching, Idle Sessions, and the Real Cost of a Million-Token Window

Coding agents like Claude Code and Cursor now have context windows that support up to a million tokens. While larger contexts are useful, they are also the reason your API costs are increasing and you are hitting usage limits faster than before.

If your $20 Pro subscription feels like it covers less ground lately, or you are running into rate limits early in the day, it comes down to how these tools manage context under the hood.

Why prompt caching matters so much

The economics of long-context models rely heavily on prompt caching. Providers like Anthropic discount cached input tokens by about 90 percent [2]. This discount is what makes a million-token window financially viable.

However, caching requires exact prefix matching. As Simon Willison has noted, if your prompt is 99 percent identical to the previous one, but the very first token has changed, the cache breaks [3]. Anthropic's own documentation confirms that caching reads sequentially—any change before a cache breakpoint invalidates everything that follows.

This becomes an issue when agents use naive keyword searches to dump dozens of raw source files into the context window. It creates a volatile prompt. Editing a single line in any of those files changes the prefix. Agents also periodically summarize conversation history to manage context limits, which shifts the prefix again. Every time this happens, you get a full cache miss.

What an idle session costs

The impact of these cache misses adds up quickly. Boris Cherny from the Claude Code team at Anthropic recently explained this on Hacker News:

Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache... The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss... In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users [1].

If you step away for an hour, your first prompt when you return will burn a massive chunk of your daily allowance before the response even comes back.

This isn't theoretical

Developers are already tracking this issue. In Claude Code issue #46829, "Cache TTL silently regressed... causing quota and cost inflation," users analyzed their session logs and found a 20 to 32 percent increase in cache creation costs, alongside a spike in quota consumption for users who rarely hit limits before [4].

When the cache drops, you pay full price for hundreds of thousands of input tokens on every request. Relying on an agent to churn through raw, un-cached source code to find an answer will drain a daily compute budget in hours.

What Carrick does differently

This is why we built Carrick. The solution is not to load thousands of lines of source code just to find a single route or type definition.

Instead of dumping files into the context window, Carrick provides a pre-computed context layer via MCP. When an agent needs to know how to construct a request body for a specific endpoint, it doesn't need to load the router tree and its dependencies. It queries Carrick.

Carrick returns the resolved mount graph and compiler-grade types. What normally takes 50,000 tokens of raw source code is condensed into about 500 tokens of structured data.

Keeping the prompt small keeps the prefix stable, which preserves the cache. For some workflows we have seen token savings of up to 95 percent*, allowing your usage limits to actually last throughout the day. By shifting the heavy lifting from the agent's context window to a dedicated cache, you stop wasting tokens on raw codebase traversal.

* Measured on semantic lookups across three TypeScript microservices, then extrapolated to a 50-source-file baseline. Keyword-friendly queries sit toward the low end of the range; the gap widens with codebase size and the number of repos searched.

References

  1. Boris Cherny (Anthropic), comment on An update on recent Claude Code quality reports, Hacker News. news.ycombinator.com/item?id=47880089
  2. Anthropic, Prompt caching, Anthropic Documentation. docs.anthropic.com
  3. Simon Willison, writing on prompt caching mechanics. simonwillison.net
  4. Claude Code issue #46829, Cache TTL silently regressed... causing quota and cost inflation. github.com/anthropics/claude-code/issues/46829