RAG vs. Hallucinations: How We Engineer “Defensible Answers” for Regulators

Executive summary: the engineering problem of the DORA era

Rolling out large language models (LLMs) in European financial services creates a basic clash: GenAI is probabilistic by design, while financial regulation demands deterministic control, evidence, and repeatability. As institutions moved toward full enforcement of the Digital Operational Resilience Act (DORA) in January 2025, the industry ran into a very practical engineering requirement: convert “creative” generative capabilities into defensible outputs that can survive scrutiny from internal audit, the European Central Bank (ECB), and national competent authorities.

The core failure mode is hallucination: a model producing plausible-sounding but factually incorrect statements. In a regulated environment, that’s not just a UX issue. It can become an operational risk event, a data integrity breach, and a control failure under DORA’s first pillar (ICT risk management). That’s why the engineering priority shifts away from maximising fluency and speed toward maximising faithfulness (groundedness) and traceability (auditability).

Retrieval-Augmented Generation (RAG) has become the de facto baseline for reducing hallucinations by grounding answers in approved enterprise data. But naive RAG (vector search + stuffing chunks into a prompt) is nowhere near sufficient for audit expectations. Regulators increasingly want you to show not only that an answer is correct, but how it was produced, which sources were used, why those sources were selected, and how the system behaves under stress and change.

This is exactly the gap we focus on at Intellectum Lab. We build document-based GenAI systems for banks, asset managers, and insurers where the output has to be trusted, traced, and defended. Our approach is “audit-ready by design”: every answer has evidence you can show to an auditor, measurable quality scores you can monitor in production, and governance that supports third-party oversight and exit strategies. Under the hood, that approach is operationalised through our control layer, Intellectum Lab AI Control, which captures per-request audit trails, enforces pre-/post-guardrails, monitors quality drift, and keeps architectures model-agnostic so exit readiness is real, not theoretical.

This article breaks down the engineering playbook for building “defensible answers” under DORA and ECB supervision expectations: architecture beyond naive RAG, evaluation governance with hard metrics (Faithfulness, Answer Relevancy), and audit logging schemas that enable reproducibility.

1) The regulatory crucible: DORA and the ECB’s new posture on AI

Europe’s fintech regulatory landscape has shifted from “best practice guidance” toward enforceable, directly applicable requirements. DORA is a regulation, not a directive, so it applies consistently across EU member states. Its goal is straightforward: ensure financial entities can withstand, respond to, and recover from ICT-related disruptions. For AI systems, the implication is equally straightforward: the era of “black box deployments” is ending.

1.1 DORA’s five pillars and what they mean for GenAI

DORA is structured around five pillars. Each one forces specific architectural and governance decisions for GenAI and RAG.

1) ICT risk management
AI models must be treated as ICT assets. Model drift and hallucinations become data integrity risks that require mitigation strategies and a defined risk appetite.

2) Incident reporting
A systematic hallucination that drives an incorrect customer outcome or compliance decision isn’t “a bad answer”. In many cases, it becomes a reportable ICT-related incident.

3) Operational resilience testing
It’s not just penetration testing anymore. GenAI systems need stress tests for prompt injection, adversarial retrieval manipulation, and “red teaming” of the entire workflow.

4) Third-party risk management
If you rely on proprietary models (for example, via a cloud provider), you inherit third-party concentration, subcontracting, and oversight challenges. DORA pushes this from “procurement paperwork” into core operational design.

5) Information sharing
Institutions are increasingly expected to share threat intelligence, including new attack vectors targeting LLM systems (jailbreak patterns, prompt injection techniques, data exfiltration methods).

The unifying idea is white-box governance. In RAG terms, this means every generated claim should be traceable to a specific retrieved fragment, which itself must map to a specific document version. When we build systems for regulated finance teams, we treat this not as a reporting layer bolted on later, but as part of the architecture from day one: it’s the only way the “answer” becomes defensible.

1.2 The ECB’s revised Internal Models Guide (2025): explainability and reproducibility

In July 2025, the ECB released a revised Internal Models Guide that, for the first time, explicitly addresses machine learning methods. The key point: it formalises expectations that ML models fall under Model Risk Management (MRM) principles, even when they’re not “capital models” in the classic sense.

Two mandates matter most for GenAI:

Explainability: you must be able to explain what drove the output.
Reproducibility: you must be able to reproduce and verify model-driven decisions.

Reproducibility is non-trivial for LLMs because generation is inherently non-deterministic when sampling is used. In regulated deployments, this is why production systems often enforce deterministic or near-deterministic configurations (temperature ~0.0 for factual tasks), version control prompts and retrieval logic, and log enough context to reconstruct the exact state of the system at inference time.

At Intellectum Lab, this is also why we insist on measured quality in production, not one-off validation. A system that was “validated last quarter” but can’t detect silent provider changes, retrieval drift, or degradation will struggle under audit pressure.

1.3 DORA Article 30: the contractual firewall

Article 30 is often the most operationally painful requirement because it reaches into contracts with ICT providers. For teams using managed LLM services (Azure OpenAI, Bedrock, Vertex AI) or hosted vector databases, it forces hard questions:

Data localisation and sovereignty: where is data processed and stored?
SLA definitions: what availability, latency, and throughput are guaranteed?
Right to audit: can the institution (and regulators) audit the provider?
Exit strategies: can you migrate without a rewrite?

This is not theoretical. Under DORA, “we could switch in a week” doesn’t count unless you’ve built for it. In our architectures, model-agnosticism is a first-class constraint: Intellectum Lab AI Control sits between your applications and model providers so that switching providers is technically feasible without rewriting the business logic.

2) Architecture for defensibility: beyond naive RAG

To satisfy these requirements, RAG has to evolve from “search + answer” into a Defensible Answer Engine. Naive RAG fails in predictable ways:

it breaks document structure,
it injects irrelevant context (retrieval noise),
it allows the model to infer links that do not exist (contextual hallucinations),
and it provides weak evidence trails that are hard to verify.

2.1 Why standard chunking fails

Most RAG pipelines chunk by character count or token count (for example, 500 tokens). That destroys semantic and structural integrity. In finance and regulation, structure carries meaning: tables, footnotes, section numbering, annexes, multi-page clauses, and definitions that bind later sections.

Defensible systems require citation-aware chunking:

Layout-aware parsing: detect headings, tables, lists, headers/footers before chunking (so you don’t split meaning across chunks).
Metadata injection per chunk, including:
- source_document_id (immutable hash such as SHA-256 for the PDF)
- page_number
- bounding_box coordinates on the page
- version_timestamp (when this document version was indexed)

This metadata is not “nice to have”. It is what makes citation verification possible. If you can’t highlight the exact paragraph in the original PDF that supports a claim, you can’t convincingly defend it during audit. In practice, this is the difference between “the model said so” and “here is Clause 7.2, page 18, version dated 2025-03-10, and here is how it was retrieved”.

2.2 The Retriever – Generator – Verifier triad

A defensible architecture adds a third component to the classic Retriever–Generator pair: a Verifier.

Retriever
Retrieves top-k candidate chunks. In regulated workflows, hybrid retrieval (keyword + vector) is often necessary so exact regulatory terms (like “Article 30”) don’t get lost in semantic approximation.
Generator
The LLM synthesises an answer and produces citations referencing retrieved chunk IDs. The prompt should explicitly forbid using non-provided context: “Answer only from the supplied evidence. If it’s not in the context, say you don’t know”.
Verifier (post-processing)
A deterministic layer or smaller model (often an NLI-style classifier) checks whether each generated claim is actually supported by the retrieved evidence. If support is below a threshold, the claim is suppressed, flagged, or the entire response falls back to a safe behaviour.

In production systems we build, verification is not a research experiment. It is a control mechanism. It turns hallucination from an unpredictable failure into a detectable event with an audit trail.

2.3 Retrieval noise and cross-encoder re-ranking

A major driver of hallucinations in RAG is retrieval noise: you retrieve “nearby but wrong” chunks and the model tries to make them fit the question. If the query is “credit risk”, and the retriever pulls “market risk” text because of shared terminology, the generator may blend concepts.

A defensible pipeline typically uses cross-encoder re-ranking:

retrieve a larger candidate set (e.g., 50 chunks),
score relevance with a cross-encoder,
pass only the top 3–5 chunks to the LLM.

This reduces the model’s temptation to improvise and makes answers more aligned with evidence.

This is also one of the reasons we push continuous monitoring in production. Retrieval quality drifts as document corpora grow and change. If you don’t measure recall/precision over time, you’ll only discover drift when someone escalates a bad answer to Risk or Compliance.

3) Evaluation governance: the mathematics of trust

In a regulated environment, “it looks right” is not validation. DORA and ECB expectations push teams toward quantitative evaluation: thresholds, KPIs, regression suites, and monitoring. This is where evaluation governance becomes its own discipline.

At Intellectum Lab, this is operationalised as part of delivery, not a research add-on. We define golden datasets and KPIs with the client, run regression suites before every change, and monitor quality continuously in production (with alerting when drift occurs). The goal is simple: you can’t defend what you can’t measure.

3.1 Faithfulness: measuring hallucinations

Faithfulness (often called groundedness) measures whether the answer is supported by retrieved context. It’s not “text similarity.” It’s logical entailment.

A typical approach:

Decompose the answer into atomic statements
Let the answer be $A$ A. Break it into statements $S = \{s_1, s_2, …, s_n\}$ S={s1,s2,…,sn}.
Example: “Einstein was born in Germany on March 20.” becomes:

$s_1$ s1: “Einstein was born in Germany”.
$s_2$ s2: “Einstein was born on March 20”.

Verify each statement against retrieved context $C$ C
For each $s_i$ si, check if $C \Rightarrow s_i$ C⇒si (entailment).
Set $v_i = 1$ vi=1 if entailed, else $0$ 0.
Score faithfulness

$F = \frac{\sum_{i=1}^{n} v_i}{|S|}$

In strict environments, a faithfulness score below a high threshold (for example, 0.90–0.95 depending on risk appetite and use case) should trigger control actions: warnings, fallback responses, or human review workflows.

3.2 Answer relevancy: does the answer actually address the question?

An answer can be fully grounded and still useless if it doesn’t address the user’s intent. Answer relevancy measures semantic alignment with the question.

A common pattern:

have an evaluator generate potential questions that could lead to answer $A$ A,
compare embeddings of those generated questions to the original query $Q_o$ Qo,
compute average cosine similarity.

Low relevancy often points to retrieval issues or prompt issues (the model producing a summary instead of a direct answer, or focusing on the wrong section).

3.3 Scaling evaluation: from lab to production

One reason evaluation fails in real deployments is that it’s too slow or too expensive to run on everything. That’s why many teams combine heavyweight evaluation during development with lightweight production scoring (fine-tuned classifiers, sampling strategies, or tiered evaluation).

In our production work, the practical requirement is coverage: you want visibility into 100% of interactions (or as close as possible), not a tiny offline sample. DORA’s logic pushes toward continuous monitoring: if you can’t demonstrate ongoing oversight, you end up doing forensic work under pressure when something goes wrong.

4) Logging and audit: engineering the digital trail

The principle is blunt: if it isn’t in the logs, it didn’t happen. For GenAI under DORA and ECB expectations, standard application logs are not enough. You need to capture the full chain of custody for each answer.

This is a major design focus in Intellectum Lab AI Control: per-request audit trails that make interactions reconstructable step by step, including prompts, retrieval paths, model versions, configuration, and policy controls applied.

4.1 An audit-ready AI log schema

To support reproducibility, logs must capture the system state at inference time. A practical JSON schema includes:

Trace context

trace_id (UUID for end-to-end interaction)
user_id (linked to RBAC)
timestamp_utc (ms precision)

Inputs and state

query_text (raw user prompt)
system_prompt_version (hash of system instructions)
model_config (temperature, top-p, penalties, etc.)

Retrieval evidence

retrieved_chunks[] including:
- chunk_id
- document_hash
- page_number
- similarity_score
- content_snippet (what was actually sent)

Evaluation scorecard

faithfulness_score
relevance_score
jailbreak_detection (boolean / score)
other policy checks (PII detection, restricted topics)

Output + citations

generated_text
citations[] mapped to text spans and chunk IDs

This isn’t only for audits. It also makes incident response realistic. When something goes wrong, you can reconstruct what happened without guesswork.

4.2 The Register of Information: mapping real dependencies

DORA requires a Register of Information for ICT third-party arrangements. For GenAI + RAG, this register must reflect which models, services, vector DBs, moderation systems, and orchestration layers support which critical functions.

If you use a hosted vector database and a separate LLM provider, both are ICT providers. A robust audit trail should connect trace_id to the actual endpoints used, enabling concentration risk assessment and subcontracting visibility.

This is another reason a control plane approach matters: it centralises dependency mapping across multi-vendor stacks instead of scattering it across product teams.

5) Third-party risk management and “critical” providers

Under DORA, critical third-party ICT providers can come under direct oversight by EU supervisory authorities. This changes vendor due diligence from a checkbox exercise into an engineering constraint.

5.1 AI vendor due diligence under DORA

A DORA-aligned due diligence process should cover:

Resilience testing and red teaming: does the vendor test against prompt injection and adversarial abuse? Can they provide evidence?
Sub-outsourcing transparency: which subcontractors are involved in processing?
Data retention and training: are customer prompts used for training? Regulated institutions typically need a hard “no”, contractually enforced.
Operational change controls: how are model updates communicated? Silent changes create validation gaps.

5.2 Exit strategy for RAG: engineering portability

Exit strategy is where most GenAI pilots fail under scrutiny, because portability wasn’t designed in.

Practical exit readiness includes:

model-agnostic prompting (avoid tuning so tightly to one model that switching breaks behaviour)
standardized vector storage + re-embedding paths
an abstraction layer (LLM gateway) so backends can be swapped without rewriting applications

This is the architectural intent behind Intellectum Lab AI Control: keep “provider switching” technically feasible and testable. Under DORA,an exit strategy is not a PDF document. It’s an exercise scenario.

6) Operational resilience and incident response for AI systems

DORA strongly emphasises incident reporting. In GenAI, the definition of “incident” expands.

6.1 What counts as an AI incident?

Examples include:

model poisoning (malicious data injected into the retrieval corpus)
systematic hallucination (pipeline defect causing widespread incorrect outputs)
prompt injection success (controls bypassed, sensitive data exposed or unsafe behaviour triggered)

6.2 Circuit breakers and fail-safe behaviour

Defensible systems must prefer safety over helpfulness.

Two patterns we implement:

pre-generation guardrails: classify and block malicious or policy-violating requests (PII, restricted topics, data exfiltration attempts).
post-generation fallback: if faithfulness drops below a safety threshold, do not show an unverified answer. Instead, return a safe message like:
“I found relevant documents, but I can’t generate a sufficiently verified answer. Please review the sources directly”.

Auditors care about this. A system that gracefully refuses is often more defensible than a system that always answers.

7) Deep dive: implementing Faithfulness in practice

Faithfulness is the core “defensibility” metric because it directly targets hallucinations.

A practical algorithmic flow:

Step 1: statement extraction
Prompt or rule-based extraction breaks the answer into atomic claims to avoid “mostly correct” masking a critical error.

Step 2: NLI verification
For each claim $s$ s, compare against retrieved context $C = \{c_1, …, c_k\}$ C={c1,…,ck}.
Use an NLI model that outputs entailment/neutral/contradiction. Mark verified if:

$P(\text{Entailment}) > \text{Threshold} \quad (\text{often } 0.9)$

Step 3: aggregation $\text{Faithfulness} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\text{Verified}(s_i))$

Strict vs. lax faithfulness: why strict wins under DORA

A subtle but important point: sometimes the model answers correctly from pretraining even when the retrieved context doesn’t include the fact. In a regulated setting, that’s dangerous. Regulatory details can be outdated, jurisdiction-dependent, or subtly wrong.

For regulated finance workflows, the safe posture is strict faithfulness:

answer only from the provided evidence,
if evidence is missing, say “I don’t know” and point to what was retrieved.

This is less “magical,” but much more defensible.

8) Synthetic data and stress testing: finding failures before users do

DORA requires resilience testing. For GenAI, that means you cannot wait for real users to discover hallucinations. You need systematic stress testing.

8.1 Building a “golden dataset” efficiently

A golden dataset contains triples: (Question, Context, True Answer). Manual creation is expensive, so many teams use synthetic generation:

sample a chunk from the corpus,
generate a difficult question answerable only from that chunk,
generate the ideal answer grounded in the chunk,
optionally have another model critique the Q/A pair for clarity.

Then run your RAG system against this dataset and classify failures:

wrong chunk retrieved → retrieval failure (fix embeddings, hybrid search, re-ranking)
right chunk retrieved but wrong answer → generation/verification failure (fix prompts, reduce temperature, strengthen verifier)

At Intellectum Lab, golden datasets and regression suites aren’t optional “nice to have.” They’re part of how we make quality measurable and stable across deployments, especially when document corpora evolve and providers update models.

9) Strategic conclusion: engineering trust, not demos

The intersection of DORA and GenAI is forcing the field to grow up. We’re moving from a demo phase, where smooth chat was enough, into an industrial phase where reliability, auditability, and safety are the product.

For engineering teams, that changes the job. Building GenAI for regulated finance turns you into a mix of software architect, model risk practitioner, and forensic accountant. The code you write now must be readable to auditors later.

This is the philosophy behind how we work at Intellectum Lab:

Discovery → Build → Run so governance and quality are defined before implementation.
Audit-ready by design so every answer has evidence, traceability, and version history.
Measured quality in production so drift is detected early, not discovered in an incident.
Model-agnostic architecture and exit readiness so DORA requirements are engineered, not “documented.”

In the DORA era, the competitive advantage won’t be “who has the coolest chatbot”. It will be who can produce answers that are fast and defensible, with proof that stands up to auditors and regulators. That’s how you engineer trust.

December 17, 2025 Add Comment

In Uncategorized