RAG is not a product. It's an architecture.
Most enterprises treat retrieval-augmented generation like a thing you install. It isn't. It's a system, and most of its failure modes live in places that never show up in a demo.
- "RAG" is not a product to install. It's an architecture pattern with many failure modes.
- Most enterprise RAG efforts fail at the chunking, retrieval-evaluation, or citation layer — not the model.
- Building reliable RAG requires investment in infrastructure invisible from the demo.
Most teams treat RAG as plumbing. It's not.
RAG isn't "dump documents into a vector store and query against them." It's a system with at least seven distinct stages: source ingestion, normalization, chunking, embedding, retrieval, re-ranking, and citation. Each stage has its own failure modes, and a problem at any one of them collapses the quality of the whole.
Treating it as plumbing produces a demo that works on the curated documents you fed it and falls apart on the messy ones the business actually has.
Where enterprise RAG actually breaks.
In our experience, three places: chunking strategy, retrieval drift, and citation hallucination. Naive fixed-window chunking destroys context across paragraph and section boundaries. Retrieval that worked at launch slowly degrades as the corpus grows or shifts — and nobody notices until users do. And the model will happily produce a citation to a document that doesn't actually support the claim, because the embedding said the document was relevant.
None of these break dramatically. They produce subtly wrong answers in plausible-looking outputs. Which is the worst kind of failure.
Question (analyst): "What was our exposure to consumer staples in Q3?"
Retrieved chunk: a paragraph from a Q3 commentary that mentions consumer staples three times in a regional rotation discussion. Embedding similarity is high. The chunk doesn't actually contain an exposure number.
Hallucinated answer: "Per the Q3 commentary, consumer staples exposure was approximately 7.2% [cite: Q3-Commentary-pg-14]."
Reality: the document doesn't say 7.2%. The model invented the number, attached a citation that looks legitimate, and the analyst pastes it into a client deck. Catching this requires a re-ranking step that scores whether the retrieved chunk actually answers the question, plus an attribution check that verifies the cited claim appears in the cited source. Neither is in a default RAG demo.
What to invest in instead of the demo.
A retrieval evaluation harness that measures precision, recall, and citation correctness on a real test set — rebuilt as the corpus changes. A re-ranking step that filters retrieved chunks before they ever hit the model. Structured citations the user can verify. And drift monitoring so you find out before the user does when retrieval quality drops.
None of this is glamorous. All of it is the difference between a system that ships and one that produces interesting output for a quarter and gets shelved.
BizzSoftware designs, builds, secures, and runs the internal applications your teams work in every day — with AI features built in. About us →