Standard RAG reduces hallucination — it does not eliminate it. For clinical decision support, where a confabulated drug interaction or a wrong dosage can harm patients, "reduced" is not good enough. Here is how we built a system where every answer is traceable to a source, or the system refuses to answer.
Large language models are fluent, confident, and wrong in ways that are very hard to detect. In most domains, a plausible-sounding incorrect answer is a nuisance. In clinical settings, it can be dangerous. A doctor relying on AI-generated clinical decision support who cannot distinguish a hallucinated guideline from a real one is worse off than a doctor using no AI at all — at least without AI, uncertainty is explicit.
Retrieval-Augmented Generation (RAG) was supposed to solve this. Ground the model's responses in retrieved documents, and it cannot hallucinate facts that are not in those documents. In practice, standard RAG reduces hallucination significantly — but does not eliminate it. Models still synthesize across multiple documents in ways that introduce errors. They still occasionally ignore the retrieved evidence in favor of parametric memory. And they produce responses that sound equally confident whether the evidence is strong or weak.
Building ExplainRAG-FC-AS — a citation-grounded clinical decision support system — forced us to confront each of these failure modes directly.
The fundamental tension in clinical RAG is between fluency and traceability. LLMs are optimized for the former. Clinical settings demand the latter. A response that cannot be traced to a specific source document, with a specific confidence level, provides no basis for a clinician to evaluate its reliability.
We designed ExplainRAG-FC-AS around a single guiding principle that we call "No Source, No Result." Every claim in a system response must be attributable to a specific retrieved document. If the retrieval system cannot find relevant supporting evidence with sufficient confidence, the system declines to generate a response — and says so explicitly, explaining what it could not find.
"A system that acknowledges uncertainty is more trustworthy than one that fills uncertainty with fluent fabrication. In clinical AI, the ability to say 'I don't know' is a feature, not a limitation."
Standard vector-search RAG retrieves the top-k most semantically similar documents to a query. In a large clinical knowledge base, this can surface documents that are individually relevant but thematically inconsistent — covering different aspects of a topic, different patient populations, or even contradictory guidelines from different clinical bodies.
We addressed this with a pre-retrieval clustering step. Before query time, we cluster the knowledge base using K-Means on sentence embeddings, grouping documents by thematic coherence. At query time, retrieval is performed within the most relevant cluster — ensuring that the documents surfaced are not just individually relevant but mutually coherent. This significantly improves the quality of the context window passed to the LLM, reducing the risk of the model synthesizing across contradictory sources.
The second layer runs Natural Language Inference (NLI) on candidate responses before they are shown to users. We use RoBERTa-MNLI to evaluate each factual claim in a generated draft against the retrieved source documents. Claims that the NLI model classifies as contradicted by the sources are flagged and either removed or presented with an explicit uncertainty marker.
This is the key innovation that distinguishes our approach from standard RAG. Rather than trusting the LLM to faithfully represent the retrieved evidence, we independently verify that faithfulness using a separate model trained specifically for textual entailment. The two models serve different roles: GPT-4 for fluent generation, RoBERTa-MNLI for factual verification. Neither role is delegated to a single model.
Every sentence in the final response is linked to the specific source passage that supports it. The interface shows not just that a response is "grounded in retrieved evidence" but which specific document, section, and passage supports each claim. Clinicians can follow a link directly to the source — a clinical guideline, a journal article, an institutional protocol — and verify the claim themselves.
Technical stack: FAISS for vector search, K-Means clustering for thematic coherence, RoBERTa-MNLI for fact verification, GPT-4 for generation, Streamlit for the clinical interface. Query time: under 10 seconds end-to-end, including NLI verification.
We evaluated ExplainRAG-FC-AS in a blind comparison with a standard RAG baseline and a no-RAG LLM. Clinical evaluators — physicians and clinical researchers — were asked to assess the responses for accuracy, reliability, and trustworthiness without knowing which system generated each response.
90% of evaluators preferred ExplainRAG-FC-AS responses over the alternatives. The primary driver was not accuracy per se — the standard RAG system also produced accurate responses much of the time — but verifiability. Clinicians reported that knowing they could trace each claim to a source changed how they felt about relying on the system. The ability to verify was itself a trust-building feature, even when verification was not performed.
Implementing the "No Source, No Result" policy was the most technically and philosophically challenging aspect of the project. Every threshold we set for the confidence score required explicit decisions about what type of error was worse: false confidence (answering without sufficient evidence) or false refusal (declining to answer when evidence exists but is weak).
In clinical settings, we consistently chose to err toward false refusal. A system that says "I cannot find reliable evidence for this query — please consult a specialist" is more useful than one that generates a fluent but under-supported answer. This required extensive calibration of the retrieval confidence threshold and the NLI entailment threshold — and ongoing monitoring in deployment to detect cases where the system was refusing queries that clinicians judged it should be able to answer.
The lesson is that citation-first RAG is not just an architecture decision. It is a design philosophy that requires explicit choices about what the system will and will not do, and those choices must be validated with the clinical users who will depend on the system.
At Solyntra, where I applied these principles to SaaS knowledge systems, we extended the architecture with metadata-aware ingestion — tagging documents with freshness, source authority, and clinical jurisdiction — and multi-agent orchestration for complex queries that require synthesizing evidence across multiple domains. The core principle remained constant: every response must be traceable, or the system declines to respond.
For clinical RAG to move from research prototype to deployable clinical tool, it needs robust audit trails, version control for knowledge bases, and integration with clinical workflow systems. The citation-first approach provides the foundation for all of these — because a system designed for traceability from the start is a system that can be audited, updated, and trusted.