June 8, 2026 2 min readPDFMarkdownhow-toRAG

How to Convert a PDF to Markdown for LLMs (The Right Way)

By The LLMtoMD team

If you're feeding documents to an LLM — for RAG, an agent, or a fine-tune — Markdown is the format you want. It's plain text (so it embeds cleanly) but keeps the structure models rely on: headings, lists, and tables. The question is how to get there from a PDF without destroying that structure on the way.

Here's the practical guide.

Why PDF → Markdown is harder than it looks

A PDF describes where ink goes on a page, not what the content means. So any conversion has to reconstruct meaning from layout — and that's where most tools fail:

Tables flatten into ungrouped numbers, severing rows from columns.
Multi-column pages scramble into the wrong reading order.
Headings lose their hierarchy, so structure disappears.
Scanned PDFs are just images of text — basic extractors return nothing usable.
Charts and diagrams get dropped entirely.

A quick copy-paste or a one-line library call gives you text, but rarely text an LLM can reason over. (We go deeper on the downstream damage in Can ChatGPT Read a PDF?.)

What good Markdown output looks like

A well-converted PDF preserves the document's shape. A financial table should come out as a real Markdown table:

## Revenue

| Year | Revenue | Notes          |
| ---- | ------- | -------------- |
| 2024 | 1,240   | see appendix B |
| 2023 | 980     |                |

…not as Revenue 2024 1,240 2023 980. Headings stay headings (##), lists stay lists, and reading order matches what a human sees. That's the difference between an LLM answering correctly and an LLM guessing.

How to convert a PDF to Markdown with LLMtoMD

Sign in and open the converter (or use the API).
Drop in your PDF — or paste a URL to a PDF online.
Pick options if needed — choose OCR languages for non-English scans, or leave the defaults.
Get clean Markdown back — layout-aware tables, preserved headings, OCR and AI vision for scanned pages and diagrams.
Use it — copy the Markdown, download it, or export RAG-ready JSONL chunks straight into your vector database.

That's it. The hard parts — column detection, table reconstruction, OCR, vision for diagrams — happen automatically.

Doing it programmatically

For pipelines, it's one API call:

curl -X POST https://api.llmtomd.com/v1/convert \
  -H "X-API-Key: $LLMTOMD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/report.pdf"}'

Poll the job, then fetch the Markdown — or the chunked export — when it's done. See the API & MCP docs for the full surface.

Other formats

Working with more than PDFs? The same pipeline handles every common format: see how to convert Word documents, PowerPoint decks, and audio and video.

Convert your first PDF free → Try LLMtoMD and read the Markdown before you trust the answer.

Convert anything to AI-ready Markdown

PDFs, Office docs, images, audio, and whole websites — clean Markdown and RAG-ready exports for your LLM, in seconds.

Convert your first document free See pricing