June 8, 2026 3 min readtraining datafine-tuningLLMdatasets

Garbage In, Garbage Out: Why Your Fine-Tune Underperformed

By The LLMtoMD team

You did everything right. You chose a solid base model, set up your training run carefully, swept the hyperparameters. And the fine-tuned model came out... mediocre. Confused on exactly the documents it was supposed to master.

Before you blame the model or the method, look at what you actually fed it. Because the oldest rule in machine learning is still the one that quietly kills most projects: garbage in, garbage out.

The corpus is the model's ceiling

A fine-tune learns the patterns in its training data. If that data is clean, consistent, and well-structured, the model has a real signal to learn from. If it's a pile of text scraped out of PDFs with collapsed tables, broken reading order, and OCR noise, the model learns that — noise, fragments, and all.

No learning rate recovers signal that was destroyed before training started. The corpus is the ceiling on how good your model can get, and most teams set that ceiling far lower than they realize — at the ingestion step they never looked at.

What "garbage" actually looks like

Open a few random samples from your training set and you'll often find:

Tables turned to number soup — rows and columns severed, so "Q3 / 1,240" is now meaningless adjacent tokens.
Scrambled reading order from multi-column PDFs, with sentences interleaved into nonsense.
OCR artifacts from scanned sources — rn read as m, dropped characters, phantom line breaks.
Lost structure — no headings, no list boundaries, so the model never learns the document's shape.
Inconsistent formatting across sources, so the same kind of content looks different every time and the model can't generalize.

Each of these is a place where the model is learning from corrupted examples — and you won't see it on a loss curve. You'll see it later, as a model that's subtly worse than it should be.

The fix: clean, consistent, structured data — at scale

High-quality training data isn't exotic. It's uniform: every source, whatever its original format, converted into clean, consistently structured text that preserves tables, headings, and reading order. The same principles that make documents readable for RAG make them good training data.

For building a corpus, three things matter:

Consistency across formats — PDFs, Office docs, images, audio, and web pages all become uniformly structured Markdown, so your dataset isn't a patchwork.
Scale and automation — convert in bulk over an API, or watch a storage prefix so new documents are processed automatically as your corpus grows.
Structure you can label — clean Markdown (plus structured extraction into fields) gives you data you can turn into supervised examples, not just a text blob.

Spend your effort here and the downstream training does more with less — because it's finally learning from signal instead of rubble.

Before your next run

Run one quick audit: pull ten random documents from your corpus and read the extracted text. If you can't follow them — if the tables are gone, the order is scrambled, the scans are noise — your model is training on exactly that, and it will perform exactly that well.

LLMtoMD turns messy real-world documents into the clean, consistent, structured Markdown across every format that good training data is made of — at the scale a corpus needs.

Fix the data, not just the model. Convert your first documents free → and read a sample before you train on it.

Convert anything to AI-ready Markdown

PDFs, Office docs, images, audio, and whole websites — clean Markdown and RAG-ready exports for your LLM, in seconds.

Convert your first document free See pricing