All posts
Jun 16, 20268 min read

Benchmarking Context Optimization

By Quotient Labs

How Idle Compression Works

As a Claude Code session grows, the context window keeps carrying everything: plans, file reads, search results, tool outputs, edits, test runs, et cetera. Some of this is useful information. Some of it is not. Some only matters for a few turns and then becomes irrelevant baggage.

But none of it is compressed until you type /clear or /compact; you're billed for everything (and re-billed for input tokens on each subsequent turn in the session).

A problem, however, is that we also can't just compact your conversation after each request. Anthropic's prompt caching, which charges cache writes at 12.5x the rate of cache reads ($6.25/Mtok vs $0.50/Mtok on Opus), means that if we were to overwrite context every time you sent a message, we'd cause cache misses, and unless we were reducing token count by at least 92% every single time (which can't be done faithfully to accuracy), we'd actually be costing you more money than we saved you.

Fortunately, though, there's a silver lining. Anthropic's prompt caching has a 300-second TTL, meaning that after 5 minutes of inactivity (which occurs reasonably often for most users), your previous cache is entirely invalidated and will be rebilled on the next turn at the cache write rate before it can be counted as a cache hit again. This means that any optimization we do after 5 minutes of inactivity, but before the next message that resets the cache, will save you money on cache writes (in the first message after TTL), and will compound across all subsequent cache reads in that session.

We run idle compression exactly in this window — compressing the older part of the conversation while preserving the freshest context raw.

Our compression process is not just an LLM summary call (unlike Claude Code's /compact…), because while that may make for a pleasant overview read, the important context is often in the small details: the exact test command that failed, the one file still left to edit, the user's chosen option after a tradeoff, the warning not to repeat an approach that already failed, or the difference between "this has been verified" and "this still needs to be checked."

Instead, we break the session context into episodes — semantically coherent blocks — and for each block focus on preserving the working state a future coding agent actually needs to pick up the slack. The compressed context is designed to make re-entry fast: instead of forcing the next model call to infer the current state from scattered fragments, our optimized context focuses on what's been done, what remains, what constraints still apply, what has been (and needs to be) verified, and what behaviors should be avoided. We unfortunately can't share more of the details though — sorry!

Benchmarking Idle Compression

Benchmarking idle compression is not simply a question of evaluating whether a summary captures the 'main points' of a task. Instead, we do our best to estimate semantic loss on real, multi-turn coding sessions.

Dataset

We evaluated several saved coding agent sessions from the SALT-NLP/SWE-chat corpus, which is a dataset of coding-agent conversations over real software repositories.

We intentionally evaluated both tool-heavy and mixed sessions, because real Claude Code work (especially when more autonomous) is often dominated by file reads, searches, command output, and edits rather than prose alone.

RepoSession typeTool shareProse chars
entireio/clitool-heavy85.88%53,738
nylanalyn/jeevesmixed41.01%128,818
adhishthite/anthropic-clio-impltool-heavy91.73%9,644
entireio/clitool-heavy75.08%145,401
anchoo2kewl/SpringSparkmixed64.49%87,714
jeevanpillay/dualmixed53.39%38,838

Total prose evaluated: 464,153 characters.

Continuation Items

For each original session, we had Claude Haiku generate a fixed set of continuation items from the uncompressed context, which are facts a future coding agent might need to keep working:

  • the active goal
  • the current implementation state
  • the next action
  • constraints from the user
  • verification results
  • blockers
  • decisions already made
  • exact files, commands, counts or names (when exactness matters)
  • mistakes or dead ends to avoid repeating

We also had Haiku label each fact as either supporting, critical, or exact-required (meaning broad paraphrase is not enough, and we need to preserve the concrete command/file/count/path/decision verbatim).

Judging

We then had an independent Claude Haiku instance judge whether the compressed context supported each fact card for a future agent picking up the work, assigning one of five labels: fully_supported, actionably_supported, partially_supported, missing, or contradicted.

We count the labels fully_supported and actionably_supported as successes when reporting our baseline numbers, and everything else as failures.

Results

We see that the launch version does preserve the information needed to continue the sessions, while still removing most of the old prose (we preserve most of our space-optimized tool results verbatim, since they're already much more compact than the raw Claude Code versions and often contain relevant context).

MetricResult
Overall support92.59%
Critical-item support93.94%
Exact-required support89.19%
Hard loss1.85%
Prose kept20.14%
Prose removed79.86%
Total byte reduction22.8%

This behavior is exactly what we want. Old prose gets cut down aggressively, but the working state survives. A future coding agent can still recover the goal, constraints, verification state, next actions, et cetera, while we remove ~80% of compressible prose and ~23% of the overall older context. Just remember — the actual cost reduction is more than 23%, since savings compound on every message after compression runs!