There is a quiet inefficiency running beneath most Claude-powered workflows, and it is costing teams real money. Every time your agent queries a database, reads a log file, or fetches an API response, the LLM receives everything — hundreds of rows when it needed three, a thousand log entries when two contained the error, 50 KB of nested JSON when the relevant payload was 200 bytes. You pay for every single token, whether Claude needed it or not.
That is the problem Headroom was built to fix.
— Tejas Chopra, Headroom creator
What Is Headroom?
Headroom is an open-source Context Optimisation Layer that sits transparently between your application and Claude (or any LLM provider). Instead of letting raw tool outputs, log dumps, and retrieval chunks flood the context window unchecked, Headroom intercepts that data, intelligently compresses it, and forwards only what the model actually needs.
The project ships in three integration modes, so you can adopt it without rewriting your stack:
| Mode | How it works | Best for |
|---|---|---|
| Library | Import Headroom directly into your Python code and compress payloads before you build your prompt | Custom agent pipelines, LangChain, LangGraph |
| Proxy | Set one environment variable; Headroom intercepts all traffic to the LLM — zero code changes | Claude Code, Cursor, Codex, Aider — any tool you use today |
| MCP Server | Exposes headroom_compress, headroom_retrieve, and headroom_stats as MCP tools |
Multi-agent workflows, Claude Code with MCP clients |
The Numbers That Matter
Headroom publishes benchmark figures from real workloads. Compression rates vary by content type — some data compresses far better than others — but the headline range is consistent across independent reports:
Log files and repetitive tool outputs sit at the top end of that range — a 45,000-token result compressing to around 4,500. Structured search results and RAG chunks compress somewhat less aggressively, but consistently deliver meaningful savings across the board.
How Headroom Actually Compresses Context
Headroom does not summarise your data the way a naive LLM call might. It applies a layered statistical and semantic analysis pipeline:
1 · Statistical filtering
When Headroom sees a database result set, it runs anomaly detection. Rows within two standard deviations of the mean for every field are candidates for compression. Errors, outliers, and boundary values are always kept in full — they are almost always what the model needs.
2 · BM25 + embedding relevance scoring
Each chunk is scored against the user’s actual query using BM25 keyword matching and embedding similarity. Only the highest-scoring chunks pass through uncompressed. The rest are condensed or dropped.
3 · Reversible compression (CCR)
This is the feature that separates Headroom from lossy summarisation. If Claude needs a specific item that was compressed, Headroom can restore the full original record on demand. You get aggressive compression without permanently discarding data — the model can always ask for more.
The Hidden Win: Prompt Caching That Actually Works
Anthropic’s prompt caching is one of the most underused cost levers available — it can cut input token costs by up to 90% on repeated calls. But it has a critical flaw in practice: caching only activates when the prefix of your prompt is byte-identical across requests. Dynamic tool outputs, timestamps, and session IDs break that prefix constantly, so the cache never hits.
Headroom includes a component called CacheAligner that stabilises these dynamic elements before they reach the LLM. By normalising the volatile portions of your context, CacheAligner ensures your system prompt prefix stays consistent across calls — turning prompt caching from a theoretical feature into a real billing line item.
If you use Claude Code with MCP tools and see your token usage spike on every new session, CacheAligner is specifically worth investigating. Many teams report that the caching fix alone recovers more cost than the compression itself.
Bonus Feature: headroom learn
One of Headroom’s less-discussed capabilities is a command that analyses your historical Claude Code conversation logs, identifies patterns in failed tool calls, and writes corrections back into your CLAUDE.md and MEMORY.md files. Subsequent sessions start with those learnings pre-loaded — the same context that previously cost thousands of tokens to re-establish is injected efficiently from the file, not re-transmitted in full from the conversation history.
Multi-Agent Workflows and Shared Memory
As more teams run parallel AI agents — Claude for one task, a separate coding agent for another — context duplication becomes a serious problem. Each agent discovers the same project structure, the same documentation, the same prior decisions independently.
Headroom addresses this with a shared compressed context store. Multiple agents can read from and write to the same store, with automatic deduplication ensuring no agent re-processes context that another has already compressed. The implementation is straightforward:
Getting Started in Under Five Minutes
The proxy mode requires no code changes at all. Install Headroom, set one environment variable, and your existing tools start compressing immediately:
For LangChain users, Headroom ships a HeadroomChatModel wrapper that drops in wherever you currently use a standard chat model. The library mode gives you direct access to the compression API if you want finer control over what gets compressed and what passes through untouched.
To validate that compression is preserving answer quality, Headroom ships with built-in evaluation tooling against standard QA benchmarks:
Typical eval results show 95%+ accuracy preservation alongside 40–90% token reduction — numbers you can show to a finance team or a sceptical engineering lead.
Who Benefits Most?
| Workflow | Primary waste source | Expected reduction |
|---|---|---|
| Claude Code / Cursor heavy users | Verbose tool output, repeated project context | 70–90% |
| RAG-powered applications | Over-retrieved chunks, low-relevance documents | 60–85% |
| Log analysis agents | Thousands of log lines, only a handful relevant | 80–95% |
| Database-querying agents | Full result sets returned when 2–3 rows needed | 75–92% |
| Multi-turn chat applications | Growing conversation history resent each turn | 40–65% |
The Context Window Is Not Free
At $3 per million input tokens for Claude Sonnet (and $6 per million beyond 200K context), token waste is a direct hit to your operating margin. The instinct is to celebrate ever-larger context windows as unlimited capability — but the practical reality is that every redundant token costs money, slows inference, and occupies space that could hold something the model actually needs.
Headroom is the first well-documented open-source tool to treat this as an engineering problem rather than a prompting problem. The compression is reversible, the accuracy loss is measurable and small, and the integration surface is minimal. For any team running agents or Claude Code at scale, it is worth benchmarking on your own workloads before your next billing cycle.
The project is on GitHub at github.com/chopratejas/headroom.
We cover the tools, workflows, and engineering decisions that practitioners actually use — from Claude Code to MCP integrations, cloud AI platforms, and beyond.
Explore tooltechsavvy.com →


