Why Your RAG Bot Hallucinates — and the Ingestion Mistake Almost Everyone Makes
By The LLMtoMD team
You built a retrieval-augmented generation (RAG) system. You picked a good model, wrote careful prompts, tuned your chunk size. And yet your bot still confidently invents answers, cites the wrong section, or returns a table as a wall of scrambled numbers.
Here's the uncomfortable truth: most RAG hallucinations don't start at the model. They start at ingestion.
The mistake: garbage text in, garbage answers out
A RAG pipeline is only as good as the text it retrieves. And for most teams, that text is created by dumping a PDF (or a scanned contract, or a slide deck) through a quick extraction step that produces something like this:
Revenue 2024 1,240 2023 980 Q1
Q2 Q3 Q4 Notes see appendix B
Was that a table? A list? Three separate paragraphs? Your embedding model has no idea — and neither will your LLM at answer time. When the retriever pulls that chunk, the model does what models do with ambiguous context: it fills the gap with a plausible guess. That guess is your hallucination.
The failure is invisible because it happens before the part of the stack you've been debugging. You've been tuning the engine while the fuel was contaminated.
Why "just extract the text" fails
The naive approach — pull raw text out of the file and chunk it — breaks in entirely predictable ways:
- Tables collapse. Rows and columns lose their structure, so "Q3 revenue" and "1,240" stop being related. The model retrieves numbers with no idea what they mean.
- Headings vanish. Without
#structure, the retriever can't tell a section title from body text, so chunks lose their anchor. - Reading order scrambles. Multi-column PDFs, footnotes, and sidebars get interleaved into nonsense.
- Scanned pages return nothing — or worse, garbled OCR that looks like text but means nothing.
- Charts and diagrams disappear entirely. The single most information-dense element on the page is silently dropped.
Each of these is a place where the model is later forced to guess.
The fix: treat ingestion as a first-class step
The reliable pattern is to convert every source — PDF, DOCX, PPTX, XLSX, images, even audio and video — into clean, structured Markdown before it ever reaches your chunker. Markdown is the sweet spot for LLMs: it preserves headings, lists, tables, and emphasis in a format models were heavily trained on, while staying plain-text simple for embeddings.
Done properly, that scrambled table above becomes:
## Revenue
| Year | Revenue | Notes |
| ---- | ------- | -------------- |
| 2024 | 1,240 | see appendix B |
| 2023 | 980 | |
Now the relationship between "2024" and "1,240" is explicit in the text itself. The retriever returns it intact. The model has nothing to guess about. The hallucination never happens.
The same principle extends past tables: real heading structure gives your chunker natural boundaries, preserved reading order keeps context coherent, and using AI vision to describe charts and diagrams means the densest part of your documents finally makes it into the index instead of being thrown away.
A quick checklist for your pipeline
Before you blame the model again, audit your ingestion:
- Open a converted chunk and read it. If you can't tell what it means, your embedding model can't either.
- Spot-check a table. Are rows and columns still aligned, or is it a number soup?
- Check a scanned page. Did you get clean text, or empty/garbled output?
- Look for your figures. Is there any representation of charts and diagrams in the text?
- Confirm heading structure survived. Can you see the document's outline in the Markdown?
If any of those fail, you've found your hallucination source — and it isn't the model.
Get clean ingestion without building it yourself
Robust document conversion is its own engineering problem — layout-aware PDF parsing, OCR, vision models for diagrams, audio transcription, web crawling. That's exactly what LLMtoMD does: point it at any file or URL and get back clean, AI-ready Markdown, plus a ready-to-use RAG export (chunked JSONL with embeddings) you can drop straight into your vector database. Your documents never train anyone's model and source files auto-delete — see our security practices for the details.
Fix the fuel, and a surprising number of "model problems" simply disappear.
Stop debugging hallucinations that start at ingestion. Convert your first document free → and read a chunk for yourself. See plans and pricing when you're ready to scale.
Convert anything to AI-ready Markdown
PDFs, Office docs, images, audio, and whole websites — clean Markdown and RAG-ready exports for your LLM, in seconds.