How to Convert a PDF to Markdown for LLMs (The Right Way)
By The LLMtoMD team
If you're feeding documents to an LLM — for RAG, an agent, or a fine-tune — Markdown is the format you want. It's plain text (so it embeds cleanly) but keeps the structure models rely on: headings, lists, and tables. The question is how to get there from a PDF without destroying that structure on the way.
Here's the practical guide.
Why PDF → Markdown is harder than it looks
A PDF describes where ink goes on a page, not what the content means. So any conversion has to reconstruct meaning from layout — and that's where most tools fail:
- Tables flatten into ungrouped numbers, severing rows from columns.
- Multi-column pages scramble into the wrong reading order.
- Headings lose their hierarchy, so structure disappears.
- Scanned PDFs are just images of text — basic extractors return nothing usable.
- Charts and diagrams get dropped entirely.
A quick copy-paste or a one-line library call gives you text, but rarely text an LLM can reason over. (We go deeper on the downstream damage in Can ChatGPT Read a PDF?.)
What good Markdown output looks like
A well-converted PDF preserves the document's shape. A financial table should come out as a real Markdown table:
## Revenue
| Year | Revenue | Notes |
| ---- | ------- | -------------- |
| 2024 | 1,240 | see appendix B |
| 2023 | 980 | |
…not as Revenue 2024 1,240 2023 980. Headings stay headings (##), lists stay lists, and reading order matches what a human sees. That's the difference between an LLM answering correctly and an LLM guessing.
How to convert a PDF to Markdown with LLMtoMD
- Sign in and open the converter (or use the API).
- Drop in your PDF — or paste a URL to a PDF online.
- Pick options if needed — choose OCR languages for non-English scans, or leave the defaults.
- Get clean Markdown back — layout-aware tables, preserved headings, OCR and AI vision for scanned pages and diagrams.
- Use it — copy the Markdown, download it, or export RAG-ready JSONL chunks straight into your vector database.
That's it. The hard parts — column detection, table reconstruction, OCR, vision for diagrams — happen automatically.
Doing it programmatically
For pipelines, it's one API call:
curl -X POST https://api.llmtomd.com/v1/convert \
-H "X-API-Key: $LLMTOMD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/report.pdf"}'
Poll the job, then fetch the Markdown — or the chunked export — when it's done. See the API & MCP docs for the full surface.
Other formats
Working with more than PDFs? The same pipeline handles every common format: see how to convert Word documents, PowerPoint decks, and audio and video.
Convert your first PDF free → Try LLMtoMD and read the Markdown before you trust the answer.
Convert anything to AI-ready Markdown
PDFs, Office docs, images, audio, and whole websites — clean Markdown and RAG-ready exports for your LLM, in seconds.