Jun 16, 20268 min read

Benchmarking Context Optimization

By Quotient Labs

How Idle Compression Works

As a Claude Code session grows, the context window keeps carrying everything: plans, file reads, search results, tool outputs, edits, test runs, et cetera. Some of this is useful information. Some of it is not. Some only matters for a few turns and then becomes irrelevant baggage.

But none of it is compressed until you type /clear or /compact; you're billed for everything (and re-billed for input tokens on each subsequent turn in the session).

A problem, however, is that we also can't just compact your conversation after each request. Anthropic's prompt caching, which charges cache writes at 12.5x the rate of cache reads ($6.25/Mtok vs $0.50/Mtok on Opus), means that if we were to overwrite context every time you sent a message, we'd cause cache misses, and unless we were reducing token count by at least 92% every single time (which can't be done faithfully to accuracy), we'd actually be costing you more money than we saved you.

Fortunately, though, there's a silver lining. Anthropic's prompt caching has a 300-second TTL, meaning that after 5 minutes of inactivity (which occurs reasonably often for most users), your previous cache is entirely invalidated and will be rebilled on the next turn at the cache write rate before it can be counted as a cache hit again. This means that any optimization we do after 5 minutes of inactivity, but before the next message that resets the cache, will save you money on cache writes (in the first message after TTL), and will compound across all subsequent cache reads in that session.

We run idle compression exactly in this window — compressing the older part of the conversation while preserving the freshest context raw.

Our compression process is not just an LLM summary call (unlike Claude Code's /compact…), because while that may make for a pleasant overview read, the important context is often in the small details: the exact test command that failed, the one file still left to edit, the user's chosen option after a tradeoff, the warning not to repeat an approach that already failed, or the difference between "this has been verified" and "this still needs to be checked."

Instead, we break the session context into episodes — semantically coherent blocks — and for each block focus on preserving the working state a future coding agent actually needs to pick up the slack. The compressed context is designed to make re-entry fast: instead of forcing the next model call to infer the current state from scattered fragments, our optimized context focuses on what's been done, what remains, what constraints still apply, what has been (and needs to be) verified, and what behaviors should be avoided. We unfortunately can't share more of the details though — sorry!

Benchmarking Idle Compression

Benchmarking idle compression is not simply a question of evaluating whether a summary captures the 'main points' of a task. Instead, we do our best to estimate semantic loss on real, multi-turn coding sessions.

Dataset

We evaluated several saved coding agent sessions from the SALT-NLP/SWE-chat corpus, which is a dataset of coding-agent conversations over real software repositories.

We intentionally evaluated both tool-heavy and mixed sessions, because real Claude Code work (especially when more autonomous) is often dominated by file reads, searches, command output, and edits rather than prose alone.

Repo	Session type	Tool share	Prose chars
entireio/cli	tool-heavy	85.88%	53,738
nylanalyn/jeeves	mixed	41.01%	128,818
adhishthite/anthropic-clio-impl	tool-heavy	91.73%	9,644
entireio/cli	tool-heavy	75.08%	145,401
anchoo2kewl/SpringSpark	mixed	64.49%	87,714
jeevanpillay/dual	mixed	53.39%	38,838

Total prose evaluated: 464,153 characters.

Continuation Items

For each original session, we had Claude Haiku generate a fixed set of continuation items from the uncompressed context, which are facts a future coding agent might need to keep working:

the active goal
the current implementation state
the next action
constraints from the user
verification results
blockers
decisions already made
exact files, commands, counts or names (when exactness matters)
mistakes or dead ends to avoid repeating

We also had Haiku label each fact as either supporting, critical, or exact-required (meaning broad paraphrase is not enough, and we need to preserve the concrete command/file/count/path/decision verbatim).

Judging

We then had an independent Claude Haiku instance judge whether the compressed context supported each fact card for a future agent picking up the work, assigning one of five labels: fully_supported, actionably_supported, partially_supported, missing, or contradicted.

We count the labels fully_supported and actionably_supported as successes when reporting our baseline numbers, and everything else as failures.

Results

We see that the launch version does preserve the information needed to continue the sessions, while still removing most of the old prose (we preserve most of our space-optimized tool results verbatim, since they're already much more compact than the raw Claude Code versions and often contain relevant context).

Metric	Result
Overall support	92.59%
Critical-item support	93.94%
Exact-required support	89.19%
Hard loss	1.85%
Prose kept	20.14%
Prose removed	79.86%
Total byte reduction	22.8%

This behavior is exactly what we want. Old prose gets cut down aggressively, but the working state survives. A future coding agent can still recover the goal, constraints, verification state, next actions, et cetera, while we remove ~80% of compressible prose and ~23% of the overall older context. Just remember — the actual cost reduction is more than 23%, since savings compound on every message after compression runs!

Back to all posts