Compressing Prose
Some Claude Code turns are mostly tools (see our other benchmarks). Others are prose: status updates, implementation summaries, planning reports, and text-heavy back-and-forth with the user.
These prose turns cost tokens too. They also become part of the conversation history, meaning that they keep costing money on later turns through cache reads.
Fermat has a small prose compression model that targets that specific surface. While we won't get into the details of the implementation — other than saying it isn't an LLM — our text compression model drops low-value tokens while preserving the factual content needed by future agents (and preserving file names/paths, tables, and code blocks verbatim). This happens at write time: when Claude emits an eligible assistant message, Fermat runs the compression model and updates the message before the tokens are re-billed for cache input, saving you money on all subsequent turns without invalidating Anthropic's cache.
Benchmarking on domain-relevant prose
We used the SALT-NLP/SWE-chat, conversations config, train split for test data. SWE-chat is useful because it contains real, domain-relevant coding agent session back-and-forth, implementation notes, test results, file paths, decisions, constraints, and debugging summaries (rather than, for example, the often-cited but generic MeetingBank dataset).
We selected assistant prose with the following filters:
- role == "assistant"
- turn_type == "assistant_response"
- conversational turns only
- 200 to 2500 characters
- >= 25 words
- mostly ASCII text
- not raw JSON
We used a streaming shuffle with fixed seeds, and extracted 500 random samples from the dataset.
For each sample, we
- Sent the original assistant text to our compression model
- Stored the compressed output
- Measured character reduction, and
- Estimated token-proxy reduction with tiktoken
cl100k_base
Fact cards
Prompting an LLM to rate compression on a 1-5 scale is subjective and error-prone. Instead, we take an approach based on fact cards.
For each original sample, we had Claude Haiku generate 3-6 atomic fact cards. Each fact is by construction objective, self-contained, checkable with a yes/no answer, and labeled as either critical or supporting depending on the importance of the information it references.
Some examples of facts are:
- File paths that were changed
- Test commands that passed or failed
- A selected implementation option
- A constraint the agent must preserve
- A deployment destination
- A line-number-sensitive claim
For our 500 samples, Claude generated 2,213 facts, of which 1,164 were labeled as critical.
We then used Claude Haiku to judge whether each fact was preserved in the compressed output text. The judge had to answer with a binary present=true or present=false. This gives us a much stricter (and more real-world) metric than simply "how good does this summary sound"?
Results
On the 500 samples, the compression model removes about 9.5% of text. While this isn't very much in raw terms, it reflects the information-dense nature of most agentic coding chats. (For comparison, we ran our same compression model on the output of asking Claude to write a 1000-word creative story, and we saw a 32.3% text reduction — that's why the eval dataset is important!)
The results are as follows:
| Metric | Count | Retention |
|---|---|---|
| Facts present | 2,201/2,213 | 99.5% |
| Critical facts present | 1,164/1,171 | 99.4% |
- Samples in which all facts were preserved: 489/500 (97.8%)
- Samples in which all critical facts were preserved: 494/500 (98.8%)
Since this only leads to around a ~10% text reduction, this isn't a substitute. But it does mean we can shave a little off your bill at every turn with functionally no cost to agent performance.