Project Overview
Reinforcing Retrieval-Augmented Generation with Reasoning-Guided Queries & Verifiable Output
Abstract: We present Reinforcing RAG, a reasoning‑guided and verification‑oriented framework that integrates query complexity analysis, decomposition, multi‑hop retrieval, evidence‑bounded generation, cross‑model verification, and an iterative decision loop. Evaluated on HotpotQA, it significantly improves evidence traceability, answer coverage, and exact match over naive baselines.
- Keywords: Retrieval-Augmented Generation, Multi-Hop QA, Verification, Iterative RAG, Transparency, HotpotQA
- Tech Stack: Python · Sentence‑BERT · FAISS · BM25 · Reciprocal Rank Fusion · OpenAI API · Dual‑Judge Verification · FastAPI (demo)
- Project Code: fyp25037 · COMP4801 Final Year Project
Team
- Lian Tuzhi (3036065159) ltzhenry@connect.hku.hk
Chen Borun (3036052190) u3605219@connect.hku.hk
Supervisor
Prof. Chao Huang
Department of Computer Science, HKU
Project Objectives
- O1 – Reasoning‑aware retrieval: Automatically analyze query complexity and decompose multi‑hop questions into sub‑queries.
- O2 – Iterative multi‑hop retrieval: Enable multiple retrieval hops with hybrid dense+sparse (RRF) and evidence accumulation.
- O3 – Evidence‑bounded generation with verification: Force answers to be grounded; use dual‑judge cross‑model verification (faithfulness, completeness, consistency).
- O4 – Decision loop & transparency: Iterative refinement (refine query / retrieve more / regenerate) and a full audit chain.
Methodology
- Query Analysis & Decomposition: Heuristic complexity score + LLM‑based splitting into 2–4 sub‑queries (rule‑based fallback).
- Multi‑Hop Hybrid Retriever: Dense (Sentence‑BERT + FAISS) and sparse (BM25) fusion with Reciprocal Rank Fusion. Up to 3 hops with keyword injection.
- Evidence Aggregator: Quality scoring (similarity, hop penalty, length) + token‑budget selection; coverage metric for redundancy detection.
- Answer Generator: “Never refuse” design – always produces an answer; quality judged by verifier. Fallback to parametric knowledge if no evidence.
- Cross‑Model Verifier: Two independent LLM judges score faithfulness, completeness, consistency; confidence = mean − divergence penalty.
- Decision Engine & Iterative Loop: Heuristic/LLM decisions; loop continues until confidence ≥ threshold or max iterations reached.
- Transparency Chain: JSON log of every analysis, decomposition, retrieval hop, generation, verification, and decision.
System Architecture

Benchmark on HotpotQA (Naive RAG vs. Reinforcing RAG)
| Metric | N=20 Naive | N=20 RR | N=50 Naive | N=50 RR | N=100 Naive | N=100 RR |
|---|---|---|---|---|---|---|
| Exact Match | 0.500 | 0.600 | 0.400 | 0.460 | 0.430 | 0.470 |
| Token F1 | 0.573 | 0.673 | 0.520 | 0.605 | 0.618 | 0.658 |
| Answer Coverage | 0.558 | 0.658 | 0.512 | 0.601 | 0.629 | 0.671 |
| Avg Confidence* | 0.590 | 0.474 | 0.576 | 0.522 | 0.754 | 0.743 |
| Avg Iterations | 1.00 | 2.05 | 1.00 | 1.94 | 1.00 | 1.63 |
| Avg Latency (s) | 2.48 | 39.38 | 2.50 | 40.11 | — | 24.92 |
| Fallback Rate | 0.00 | 0.50 | 0.06 | 0.46 | 0.00 | 0.22 |
Key Contributions
- Modular RAG pipeline with hybrid retrieval + dual‑judge verification.
- Iterative loop with three recovery actions (refine, retrieve more, regenerate).
- Full transparency chain for auditability.
- Empirical gains on HotpotQA: +9.3% EM, +6.5% Token F1, +6.7% Coverage (N=100).
Transparency Chain Example
[
{"step": "query_analysis", "complexity":0.65, "type":"multi-part"},
{"step": "decomposition", "subqueries":["What is DNA replication?",...]},
{"step": "retrieval_hop1", "evidence_count":5},
{"step": "verification", "confidence":0.72, "divergence":0.08},
{"step": "decision", "action": "retrieve_more"}
]
Every step logged — enables full reproducibility and trust.
Project Milestones
| Phase | Milestone | Deliverable |
|---|---|---|
| Phase 1 (Weeks 1–4) | Literature review, dataset prep, baseline RAG | Indexed HotpotQA, naive RAG baseline |
| Phase 2 (Weeks 5–8) | Query analyzer, decomposer, hybrid retriever | Multi‑hop retrieval module + evidence aggregator |
| Phase 3 (Weeks 9–11) | Generator, dual‑judge verifier, decision engine | Iterative loop prototype, confidence scoring |
| Phase 4 (Weeks 12–13) | Full integration, benchmarking, transparency chain | CLI + web demo, benchmark script |
| Final | Report writing, evaluation, submission | Final report, code, slides, video |
Deliverables
- Code & Models: Full Python pipeline (planner, retriever, aggregator, generator, verifier, decision engine). Configuration via .env and config.py.
- Demo: Lightweight web interface (Flask/FastAPI) with transparency chain visualization.
- Datasets & Index: HotpotQA distractor setting, FAISS index, BM25 index. Reproducible evaluation scripts.
- Reports: Interim report, final report (PDF), and experiment logs.
- Presentation: Slides and demo video for final defense.
Limitations
- Latency ~25–40s due to multiple LLM calls.
- Evaluated only on HotpotQA; domain adaptation needed.
- Fixed evidence budget may truncate long answers.
- Answer‑level verification (claim‑level planned).
Future Work
- Claim‑level verification & V‑score.
- Adaptive iteration budget & early stopping.
- Specialized small judge models for efficiency.
- Integration with structured knowledge bases.
- User‑in‑the‑loop transparency & deployment.
Contact
For questions or collaboration, please contact Chen Borun or Lian Tuzhi. We welcome discussions on verifiable RAG and multi‑hop reasoning.