Jun 14, 20265 min read

Compressing Prose

By Quotient Labs

Some Claude Code turns are mostly tools (see our other benchmarks). Others are prose: status updates, implementation summaries, planning reports, and text-heavy back-and-forth with the user.

These prose turns cost tokens too. They also become part of the conversation history, meaning that they keep costing money on later turns through cache reads.

Fermat has a small prose compression model that targets that specific surface. While we won't get into the details of the implementation — other than saying it isn't an LLM — our text compression model drops low-value tokens while preserving the factual content needed by future agents (and preserving file names/paths, tables, and code blocks verbatim). This happens at write time: when Claude emits an eligible assistant message, Fermat runs the compression model and updates the message before the tokens are re-billed for cache input, saving you money on all subsequent turns without invalidating Anthropic's cache.

Benchmarking on domain-relevant prose

We used the SALT-NLP/SWE-chat, conversations config, train split for test data. SWE-chat is useful because it contains real, domain-relevant coding agent session back-and-forth, implementation notes, test results, file paths, decisions, constraints, and debugging summaries (rather than, for example, the often-cited but generic MeetingBank dataset).

We selected assistant prose with the following filters:

role == "assistant"
turn_type == "assistant_response"
conversational turns only
200 to 2500 characters
>= 25 words
mostly ASCII text
not raw JSON

We used a streaming shuffle with fixed seeds, and extracted 500 random samples from the dataset.

For each sample, we

Sent the original assistant text to our compression model
Stored the compressed output
Measured character reduction, and
Estimated token-proxy reduction with tiktoken cl100k_base

Fact cards

Prompting an LLM to rate compression on a 1-5 scale is subjective and error-prone. Instead, we take an approach based on fact cards.

For each original sample, we had Claude Haiku generate 3-6 atomic fact cards. Each fact is by construction objective, self-contained, checkable with a yes/no answer, and labeled as either critical or supporting depending on the importance of the information it references.

Some examples of facts are:

File paths that were changed
Test commands that passed or failed
A selected implementation option
A constraint the agent must preserve
A deployment destination
A line-number-sensitive claim

For our 500 samples, Claude generated 2,213 facts, of which 1,164 were labeled as critical.

We then used Claude Haiku to judge whether each fact was preserved in the compressed output text. The judge had to answer with a binary present=true or present=false. This gives us a much stricter (and more real-world) metric than simply "how good does this summary sound"?

Results

On the 500 samples, the compression model removes about 9.5% of text. While this isn't very much in raw terms, it reflects the information-dense nature of most agentic coding chats. (For comparison, we ran our same compression model on the output of asking Claude to write a 1000-word creative story, and we saw a 32.3% text reduction — that's why the eval dataset is important!)

The results are as follows:

Metric	Count	Retention
Facts present	2,201/2,213	99.5%
Critical facts present	1,164/1,171	99.4%

Samples in which all facts were preserved: 489/500 (97.8%)
Samples in which all critical facts were preserved: 494/500 (98.8%)

Since this only leads to around a ~10% text reduction, this isn't a substitute. But it does mean we can shave a little off your bill at every turn with functionally no cost to agent performance.

Back to all posts