We Measured It — How Many Tokens a Memory Layer Actually Saves Your Coding Agent
By The LLMtoMD team
Everyone selling an "AI memory layer" throws around token-savings numbers. Most are hand-waved. So we measured ours on a real document and we're showing our work — including the caveats that make the number honest.
The setup
We took an actual 50-page functional requirements & system design document (an FRD for an aviation data platform) and converted it to clean Markdown with LLMtoMD. Measured:
- 143,134 characters
- 22,070 words
- 35,783 tokens
- split into 203 retrievable chunks
Then we asked one ordinary question a coding agent building this system would ask: "What are the authentication and authorization requirements for the platform?" — two different ways.
Way 1 — no memory layer (load the whole doc)
The default move: put the FRD in the agent's context so it can "read" it. To answer that one question, the model carries the entire 35,783-token document.
Way 2 — memory layer (retrieve only what's needed)
With the document stored in LLMtoMD, the agent runs a semantic search and pulls back only the relevant passages. For that question it retrieved the top 5 chunks — about 1,180 tokens — and those chunks contained the exact answer: the IAM requirements (FR-IAM-001 through 010), RBAC + ABAC authorization, multi-factor auth for privileged actions, OIDC/SAML federation. The agent then synthesized a cited answer from that slice.
The result
| Tokens for the question | |
|---|---|
| Load the whole FRD | 35,783 |
| Retrieve only what's relevant | ≈ 1,180 |
| Reduction | ≈ 96.7% |
For that question, the memory layer used ~30× fewer tokens — and lost nothing, because the retrieved slice held the complete answer.
Now scale it to a real build session
One question is the conservative case. Coding agents run long — and without a memory layer, that 35,783-token document tends to sit in context across many turns:
- Without: keep the FRD in context across a 20-turn build → ~35.8k × 20 ≈ ~716,000 tokens of document context.
- With: retrieve per question, say ~10 lookups × ~1.2k ≈ ~12,000 tokens.
That's a ~98% reduction in document-context tokens over the session.
The honest caveat
Two things keep this number from being a lie:
- Prompt caching narrows the cost gap. Modern models cache a stable block of context that's re-sent each turn and re-read it at a fraction of the price. So the dollar savings are smaller than the raw token reduction. The token-count reduction is real; the cost reduction is real but smaller.
- The bigger wins aren't even the tokens. Not blowing your context window on a large codebase, and an agent that stays consistent with the spec instead of drifting, matter more than the line-item savings on most projects.
Why it works
Two reasons the retrieved slice beats the whole document:
- Clean Markdown. We measured the document after converting it to structured Markdown — headings, tables, and lists intact. Feed an agent a raw PDF instead and it spends tokens (and accuracy) fighting layout noise. (Why that wrecks answer quality.)
- Retrieval, not stuffing. The agent asks for what it needs over MCP and gets a focused, cited answer — instead of re-reading 35,000 tokens to find one table.
The takeaway
On a real 50-page spec, giving the coding agent a memory layer cut the tokens for a question by ~97%, and the document context across a session by ~98% — while improving the answer, because retrieval returns the exact relevant requirements with citations.
That's the whole point of Give Your AI Coding Agent a Memory: stop re-sending the spec, retrieve it.
Measure it on your own docs. Convert your first document free →, connect your coding tool, and watch the token counter. See how it connects to your tools.
Convert anything to AI-ready Markdown
PDFs, Office docs, images, audio, and whole websites — clean Markdown and RAG-ready exports for your LLM, in seconds.