All posts
June 11, 2026 3 min readAI coding agentstokensbenchmarkMCPRAG

We Measured It — How Many Tokens a Memory Layer Actually Saves Your Coding Agent

By The LLMtoMD team

Everyone selling an "AI memory layer" throws around token-savings numbers. Most are hand-waved. So we measured ours on a real document and we're showing our work — including the caveats that make the number honest.

The setup

We took an actual 50-page functional requirements & system design document (an FRD for an aviation data platform) and converted it to clean Markdown with LLMtoMD. Measured:

  • 143,134 characters
  • 22,070 words
  • 35,783 tokens
  • split into 203 retrievable chunks

Then we asked one ordinary question a coding agent building this system would ask: "What are the authentication and authorization requirements for the platform?" — two different ways.

Way 1 — no memory layer (load the whole doc)

The default move: put the FRD in the agent's context so it can "read" it. To answer that one question, the model carries the entire 35,783-token document.

Way 2 — memory layer (retrieve only what's needed)

With the document stored in LLMtoMD, the agent runs a semantic search and pulls back only the relevant passages. For that question it retrieved the top 5 chunks — about 1,180 tokens — and those chunks contained the exact answer: the IAM requirements (FR-IAM-001 through 010), RBAC + ABAC authorization, multi-factor auth for privileged actions, OIDC/SAML federation. The agent then synthesized a cited answer from that slice.

The result

Tokens for the question
Load the whole FRD35,783
Retrieve only what's relevant≈ 1,180
Reduction≈ 96.7%

For that question, the memory layer used ~30× fewer tokens — and lost nothing, because the retrieved slice held the complete answer.

Now scale it to a real build session

One question is the conservative case. Coding agents run long — and without a memory layer, that 35,783-token document tends to sit in context across many turns:

  • Without: keep the FRD in context across a 20-turn build → ~35.8k × 20 ≈ ~716,000 tokens of document context.
  • With: retrieve per question, say ~10 lookups × ~1.2k ≈ ~12,000 tokens.

That's a ~98% reduction in document-context tokens over the session.

The honest caveat

Two things keep this number from being a lie:

  1. Prompt caching narrows the cost gap. Modern models cache a stable block of context that's re-sent each turn and re-read it at a fraction of the price. So the dollar savings are smaller than the raw token reduction. The token-count reduction is real; the cost reduction is real but smaller.
  2. The bigger wins aren't even the tokens. Not blowing your context window on a large codebase, and an agent that stays consistent with the spec instead of drifting, matter more than the line-item savings on most projects.

Why it works

Two reasons the retrieved slice beats the whole document:

  • Clean Markdown. We measured the document after converting it to structured Markdown — headings, tables, and lists intact. Feed an agent a raw PDF instead and it spends tokens (and accuracy) fighting layout noise. (Why that wrecks answer quality.)
  • Retrieval, not stuffing. The agent asks for what it needs over MCP and gets a focused, cited answer — instead of re-reading 35,000 tokens to find one table.

The takeaway

On a real 50-page spec, giving the coding agent a memory layer cut the tokens for a question by ~97%, and the document context across a session by ~98% — while improving the answer, because retrieval returns the exact relevant requirements with citations.

That's the whole point of Give Your AI Coding Agent a Memory: stop re-sending the spec, retrieve it.


Measure it on your own docs. Convert your first document free →, connect your coding tool, and watch the token counter. See how it connects to your tools.

Convert anything to AI-ready Markdown

PDFs, Office docs, images, audio, and whole websites — clean Markdown and RAG-ready exports for your LLM, in seconds.