Project Overview

Reinforcing Retrieval-Augmented Generation with Reasoning-Guided Queries & Verifiable Output

Abstract: We present Reinforcing RAG, a reasoning‑guided and verification‑oriented framework that integrates query complexity analysis, decomposition, multi‑hop retrieval, evidence‑bounded generation, cross‑model verification, and an iterative decision loop. Evaluated on HotpotQA, it significantly improves evidence traceability, answer coverage, and exact match over naive baselines.

Keywords: Retrieval-Augmented Generation, Multi-Hop QA, Verification, Iterative RAG, Transparency, HotpotQA
Tech Stack: Python · Sentence‑BERT · FAISS · BM25 · Reciprocal Rank Fusion · OpenAI API · Dual‑Judge Verification · FastAPI (demo)
Project Code: fyp25037 · COMP4801 Final Year Project

Team

Lian Tuzhi (3036065159) ltzhenry@connect.hku.h k
Chen Borun (3036052190) u3605219@connect.hku.hk

Supervisor

Prof. Chao Huang
Department of Computer Science, HKU

Project Objectives

O1 – Reasoning‑aware retrieval: Automatically analyze query complexity and decompose multi‑hop questions into sub‑queries.
O2 – Iterative multi‑hop retrieval: Enable multiple retrieval hops with hybrid dense+sparse (RRF) and evidence accumulation.
O3 – Evidence‑bounded generation with verification: Force answers to be grounded; use dual‑judge cross‑model verification (faithfulness, completeness, consistency).
O4 – Decision loop & transparency: Iterative refinement (refine query / retrieve more / regenerate) and a full audit chain.

Methodology

Query Analysis & Decomposition: Heuristic complexity score + LLM‑based splitting into 2–4 sub‑queries (rule‑based fallback).
Multi‑Hop Hybrid Retriever: Dense (Sentence‑BERT + FAISS) and sparse (BM25) fusion with Reciprocal Rank Fusion. Up to 3 hops with keyword injection.
Evidence Aggregator: Quality scoring (similarity, hop penalty, length) + token‑budget selection; coverage metric for redundancy detection.
Answer Generator: “Never refuse” design – always produces an answer; quality judged by verifier. Fallback to parametric knowledge if no evidence.
Cross‑Model Verifier: Two independent LLM judges score faithfulness, completeness, consistency; confidence = mean − divergence penalty.
Decision Engine & Iterative Loop: Heuristic/LLM decisions; loop continues until confidence ≥ threshold or max iterations reached.
Transparency Chain: JSON log of every analysis, decomposition, retrieval hop, generation, verification, and decision.

System Architecture

Reinforcing RAG Pipeline Architecture — 1. Query Analysis → 2. Decomposition → 3. Multi‑Hop Retrieval (dense+sparse) → 4. Evidence Aggregation → 5. Answer Generation → 6. Cross‑Model Verification → 7. Decision Engine & Iterative Loop (refine/retrieve more/regenerate). Full transparency chain recorded.

Benchmark on HotpotQA (Naive RAG vs. Reinforcing RAG)

Metric	N=20 Naive	N=20 RR	N=50 Naive	N=50 RR	N=100 Naive	N=100 RR
Exact Match	0.500	0.600	0.400	0.460	0.430	0.470
Token F1	0.573	0.673	0.520	0.605	0.618	0.658
Answer Coverage	0.558	0.658	0.512	0.601	0.629	0.671
Avg Confidence*	0.590	0.474	0.576	0.522	0.754	0.743
Avg Iterations	1.00	2.05	1.00	1.94	1.00	1.63
Avg Latency (s)	2.48	39.38	2.50	40.11	—	24.92
Fallback Rate	0.00	0.50	0.06	0.46	0.00	0.22

*Naive confidence = top retrieval similarity; RR confidence = dual‑judge fused score. Reinforcing RAG consistently improves exact match, token F1, and answer coverage. Iterative loop reduces fallback to 22% at N=100.

Key Contributions

Modular RAG pipeline with hybrid retrieval + dual‑judge verification.
Iterative loop with three recovery actions (refine, retrieve more, regenerate).
Full transparency chain for auditability.
Empirical gains on HotpotQA: +9.3% EM, +6.5% Token F1, +6.7% Coverage (N=100).

Transparency Chain Example

[
  {"step": "query_analysis", "complexity":0.65, "type":"multi-part"},
  {"step": "decomposition", "subqueries":["What is DNA replication?",...]},
  {"step": "retrieval_hop1", "evidence_count":5},
  {"step": "verification", "confidence":0.72, "divergence":0.08},
  {"step": "decision", "action": "retrieve_more"}
]

Every step logged — enables full reproducibility and trust.

Project Milestones

Phase	Milestone	Deliverable
Phase 1 (Weeks 1–4)	Literature review, dataset prep, baseline RAG	Indexed HotpotQA, naive RAG baseline
Phase 2 (Weeks 5–8)	Query analyzer, decomposer, hybrid retriever	Multi‑hop retrieval module + evidence aggregator
Phase 3 (Weeks 9–11)	Generator, dual‑judge verifier, decision engine	Iterative loop prototype, confidence scoring
Phase 4 (Weeks 12–13)	Full integration, benchmarking, transparency chain	CLI + web demo, benchmark script
Final	Report writing, evaluation, submission	Final report, code, slides, video

Deliverables

Code & Models: Full Python pipeline (planner, retriever, aggregator, generator, verifier, decision engine). Configuration via .env and config.py.
Demo: Lightweight web interface (Flask/FastAPI) with transparency chain visualization.
Datasets & Index: HotpotQA distractor setting, FAISS index, BM25 index. Reproducible evaluation scripts.
Reports: Interim report, final report (PDF), and experiment logs.
Presentation: Slides and demo video for final defense.

Limitations

Latency ~25–40s due to multiple LLM calls.
Evaluated only on HotpotQA; domain adaptation needed.
Fixed evidence budget may truncate long answers.
Answer‑level verification (claim‑level planned).

Future Work

Claim‑level verification & V‑score.
Adaptive iteration budget & early stopping.
Specialized small judge models for efficiency.
Integration with structured knowledge bases.
User‑in‑the‑loop transparency & deployment.

Contact

For questions or collaboration, please contact Chen Borun or Lian Tuzhi. We welcome discussions on verifiable RAG and multi‑hop reasoning.

Reinforcing Retrieval‑Augmented Generation with Reasoning‑Guided Queries and Verifiable Output