May 7, 2026 · Anton Grishko
Cheap agents — four moves that keep our token bill from eating us
Default agentic-DevOps setups burn tokens like firewood. Four moves — graph-backed MCP, the caveman skill loaded first, smart per-session tool loading, and the orchestrator pattern — keep our bill roughly 10x lower than naive.
TL;DR — Default agentic-DevOps setups burn tokens like firewood. Four stacking moves cut our Anthropic bill ~10x: graph-backed MCP instead of context dumps, the Caveman skill loaded first, per-session tool loading, and an orchestrator with short-lived workers. Skip move 1 and the rest barely register.
The default is expensive
The default way to run an AI agent on infrastructure: dump the Terragrunt repo into the prompt, ask the question, get a 4,000-token reply that begins with "Great question! Let me start by analyzing..." and ends with "Let me know if you'd like me to elaborate on any of these points!"
Multiply that by a dozen developers, each running half a dozen agent sessions a day, and the Anthropic bill becomes a four-figure line item somebody has to defend in the budget review.
We spent the last few months tuning this on customer repos. Four moves, in roughly the order of how much they save:
Move 1: graph-backed MCP, not context dumps
The biggest input-token sink is the repo itself. A 500-module Terragrunt monorepo is roughly a million tokens stuffed into context. Most of it is irrelevant to the question at hand.
The fix: expose the repo — and the live cluster, and the Terraform state, and the ArgoCD apps, and the docs — as a graph behind an MCP server. We wrote about this in One graph, every source — what kuberly-graph sees now. Instead of pre-loading context, the agent asks for what it needs:
agent: owners(Deployment kuberly-web-cms)
mcp: ArgoCD App kuberly-web; ExternalSecret kuberly-web-secrets
agent: depends_on(Deployment kuberly-web-cms)
mcp: ConfigMap kuberly-web-config; Service kuberly-web-cms; ...
agent: docs_for(Deployment kuberly-web-cms)
mcp: runbook: "drain admin sessions first"
postmortem 2026-02-14: "image bump broke webhook keys"
For "what does this change touch" questions, input drops from ~200k tokens (whole repo dump) to ~3–5k (six structured graph queries). That's a ~40x cut on the most expensive part of the prompt. For the architecture, see Knowledge graphs are the missing piece.
The catch: the agent has to know to use the graph instead of asking for files. We solve it with system-prompt instructions and — more importantly — by not preloading files. If the agent has to actively call a tool to get information, it does so sparingly. If you preload, it ignores the tool.
Move 2: caveman first
Output tokens are smaller in volume than input but still expensive, and most of them are filler. "I'll start by analyzing...", "Let me know if you'd like...", "Here's a brief summary...", the breathless tour through what was just done. Nobody reads it. We pay for it anyway.
Caveman is a Claude Code skill that fixes this. Ten rules, written like the title implies:
- No filler phrases. No "Great question."
- Execute before explaining.
- No preamble. No postamble.
- No tool announcements ("I'll now use the Read tool to...").
- Errors are things to fix, not narrate.
- Code speaks for itself; comments only when non-obvious.
We force it to load first — it's the first skill in our Claude Code settings.json skill list. That ordering matters: skills loaded later inherit the persona of skills loaded earlier, and caveman's persona is dominant. If you load caveman after a more verbose skill, the verbose one wins.
Output tokens drop 60–75% on structured coding tasks. The session-wide impact is smaller (output is roughly a quarter of session tokens; input dominates) — call it a free 5–10% cut on the total bill. The side effect is that responses are better: brevity-constrained responses score higher on coding benchmarks across the board (the standard reference for context bloat is Liu et al.). The agent isn't dumber when it's terse. It's more focused.
Move 3: smart tool loading
MCP tool definitions are surprisingly expensive. Each tool description, with its parameters and examples, costs 200–500 tokens of system prompt. A naive setup that registers every tool from every MCP server burns 5–10k tokens before the agent has done anything.
Worse, the agent gets distracted. Give it 40 tools and it spends extra reasoning tokens deciding which one to call. Give it the 6 tools relevant to the current task and it picks fast.
We load tools by session shape:
Session shape Tools loaded
───────────── ────────────
"investigate alert" kuberly-monitor only (loki, prom, tempo, k8s_events)
"edit IaC" kuberly-graph + git tools
"fan-out worker" kuberly-graph + claim_task
"deploy review" kuberly-graph + kuberly-monitor (read-only subset)
Each session profile is a different settings.json with a different MCP allowlist. We launch the agent with the right profile based on what the user asked for. The orchestrator picks the profile when it spawns a worker. The dynamic-loading architecture is in Teaching an Agent to Think in Graphs.
Saving: 5–10k tokens of system prompt per session, plus 10–20% fewer reasoning tokens because the agent isn't choosing between irrelevant tools.
Move 4: orchestrator + short-lived workers
The fan-out shape from When one agent isn't enough — fanning out work across an agent fleet is also a token-saving pattern. Compare:
Naive: one agent, 8 clusters, all in context
Round 1: context = 50k (initial repo state)
Round 5: context = 180k (8 clusters × growing per-cluster work)
Round 10: context = 400k (everything everywhere all at once)
Total: ~2–3M tokens
Fleet: orchestrator + 8 short-lived workers
Orchestrator: context = 8k (plan, not work)
Each worker: context = 20–30k (own cluster only)
Total: ~250–400k tokens
That's not 2x — it's 5–10x. Contexts don't grow quadratically when work is partitioned. Each worker sees its slice and only its slice. Failures don't pollute siblings. When a worker is done, its context dies with it.
What we don't do
- Don't truncate the transcript mid-session. The agent loses thread, starts repeating questions, costs more than the truncation saved.
- Don't compress past tool outputs. We tried summarizing old tool calls to save space; the summarizer drops the one detail the agent needed two turns later.
- Don't share context across workers. Even read-only sharing creates subtle pollution. Each worker is an island.
- Don't disable thinking tokens. Thinking is the cheapest token type per unit of work it produces. Cutting it makes the agent worse and barely saves anything.
The actual savings
For a developer running ~30 agent sessions a day on a customer's IaC repo, the four moves stack roughly like this (illustrative — your mileage will vary by repo size and task mix):
Default: 100% (baseline)
+ graph-backed MCP: ~25% (4x cut, mostly input)
+ caveman: ~22% (additional 10% on output)
+ smart tool loading: ~20% (additional 5–10% on system prompt)
+ orchestrator on parallel work: ~10% (5–10x on the parallel slice)
Roughly 10x less tokens on average sessions, more on parallel work. The Anthropic bill that used to be a quarterly conversation is now mostly invisible — which is what good infrastructure should be.
What you can run today
Caveman is a one-line install for Claude Code. The graph + MCP pattern is what every Kuberly customer gets by default. Smart tool loading is a settings.json discipline — split your profiles by session shape and launch the agent with the right one. The orchestrator pattern needs the small fan-out MCP server we covered in the agent-fleet post.
None of these are research ideas. All four ship. The compounding is the point: each move stacks on the previous, and the effect is multiplicative, not additive. Skip move 1 and the others save you almost nothing. Do all four and the bill stops being a thing you think about.
Further reading
- Anthropic — Building effective agents — patterns and anti-patterns.
- Anthropic prompt caching — the cheapest savings you can stack on top.
- Model Context Protocol — the standard the tools live under.
- Caveman skill — drop-in Claude Code skill.
- Lost in the Middle (Liu et al.) — empirical evidence context bloat hurts reasoning.
- Teaching an Agent to Think in Graphs — dynamic tool loading.
- When one agent isn't enough — the orchestrator pattern.
Want your agent bill cut without losing capability? Talk to us.