Project Overview

Reinforcing Retrieval-Augmented Generation with Reasoning-Guided Queries & Verifiable Output

Abstract: We present Reinforcing RAG, a reasoning‑guided and verification‑oriented framework that integrates query complexity analysis, decomposition, multi‑hop retrieval, evidence‑bounded generation, cross‑model verification, and an iterative decision loop. Evaluated on HotpotQA, it significantly improves evidence traceability, answer coverage, and exact match over naive baselines.

  • Keywords: Retrieval-Augmented Generation, Multi-Hop QA, Verification, Iterative RAG, Transparency, HotpotQA
  • Tech Stack: Python · Sentence‑BERT · FAISS · BM25 · Reciprocal Rank Fusion · OpenAI API · Dual‑Judge Verification · FastAPI (demo)
  • Project Code: fyp25037 · COMP4801 Final Year Project

Team

Supervisor

Prof. Chao Huang
Department of Computer Science, HKU

Project Objectives

  • O1 – Reasoning‑aware retrieval: Automatically analyze query complexity and decompose multi‑hop questions into sub‑queries.
  • O2 – Iterative multi‑hop retrieval: Enable multiple retrieval hops with hybrid dense+sparse (RRF) and evidence accumulation.
  • O3 – Evidence‑bounded generation with verification: Force answers to be grounded; use dual‑judge cross‑model verification (faithfulness, completeness, consistency).
  • O4 – Decision loop & transparency: Iterative refinement (refine query / retrieve more / regenerate) and a full audit chain.

Methodology

  • Query Analysis & Decomposition: Heuristic complexity score + LLM‑based splitting into 2–4 sub‑queries (rule‑based fallback).
  • Multi‑Hop Hybrid Retriever: Dense (Sentence‑BERT + FAISS) and sparse (BM25) fusion with Reciprocal Rank Fusion. Up to 3 hops with keyword injection.
  • Evidence Aggregator: Quality scoring (similarity, hop penalty, length) + token‑budget selection; coverage metric for redundancy detection.
  • Answer Generator: “Never refuse” design – always produces an answer; quality judged by verifier. Fallback to parametric knowledge if no evidence.
  • Cross‑Model Verifier: Two independent LLM judges score faithfulness, completeness, consistency; confidence = mean − divergence penalty.
  • Decision Engine & Iterative Loop: Heuristic/LLM decisions; loop continues until confidence ≥ threshold or max iterations reached.
  • Transparency Chain: JSON log of every analysis, decomposition, retrieval hop, generation, verification, and decision.

System Architecture

Reinforcing RAG Pipeline Architecture
1. Query Analysis → 2. Decomposition → 3. Multi‑Hop Retrieval (dense+sparse) → 4. Evidence Aggregation → 5. Answer Generation → 6. Cross‑Model Verification → 7. Decision Engine & Iterative Loop (refine/retrieve more/regenerate). Full transparency chain recorded.

Benchmark on HotpotQA (Naive RAG vs. Reinforcing RAG)

MetricN=20 NaiveN=20 RRN=50 NaiveN=50 RRN=100 NaiveN=100 RR
Exact Match0.5000.6000.4000.4600.4300.470
Token F10.5730.6730.5200.6050.6180.658
Answer Coverage0.5580.6580.5120.6010.6290.671
Avg Confidence*0.5900.4740.5760.5220.7540.743
Avg Iterations1.002.051.001.941.001.63
Avg Latency (s)2.4839.382.5040.1124.92
Fallback Rate0.000.500.060.460.000.22
*Naive confidence = top retrieval similarity; RR confidence = dual‑judge fused score. Reinforcing RAG consistently improves exact match, token F1, and answer coverage. Iterative loop reduces fallback to 22% at N=100.

Key Contributions

  • Modular RAG pipeline with hybrid retrieval + dual‑judge verification.
  • Iterative loop with three recovery actions (refine, retrieve more, regenerate).
  • Full transparency chain for auditability.
  • Empirical gains on HotpotQA: +9.3% EM, +6.5% Token F1, +6.7% Coverage (N=100).

Transparency Chain Example

[
  {"step": "query_analysis", "complexity":0.65, "type":"multi-part"},
  {"step": "decomposition", "subqueries":["What is DNA replication?",...]},
  {"step": "retrieval_hop1", "evidence_count":5},
  {"step": "verification", "confidence":0.72, "divergence":0.08},
  {"step": "decision", "action": "retrieve_more"}
]

Every step logged — enables full reproducibility and trust.

Project Milestones

PhaseMilestoneDeliverable
Phase 1 (Weeks 1–4)Literature review, dataset prep, baseline RAGIndexed HotpotQA, naive RAG baseline
Phase 2 (Weeks 5–8)Query analyzer, decomposer, hybrid retrieverMulti‑hop retrieval module + evidence aggregator
Phase 3 (Weeks 9–11)Generator, dual‑judge verifier, decision engineIterative loop prototype, confidence scoring
Phase 4 (Weeks 12–13)Full integration, benchmarking, transparency chainCLI + web demo, benchmark script
FinalReport writing, evaluation, submissionFinal report, code, slides, video

Deliverables

  • Code & Models: Full Python pipeline (planner, retriever, aggregator, generator, verifier, decision engine). Configuration via .env and config.py.
  • Demo: Lightweight web interface (Flask/FastAPI) with transparency chain visualization.
  • Datasets & Index: HotpotQA distractor setting, FAISS index, BM25 index. Reproducible evaluation scripts.
  • Reports: Interim report, final report (PDF), and experiment logs.
  • Presentation: Slides and demo video for final defense.

Limitations

  • Latency ~25–40s due to multiple LLM calls.
  • Evaluated only on HotpotQA; domain adaptation needed.
  • Fixed evidence budget may truncate long answers.
  • Answer‑level verification (claim‑level planned).

Future Work

  • Claim‑level verification & V‑score.
  • Adaptive iteration budget & early stopping.
  • Specialized small judge models for efficiency.
  • Integration with structured knowledge bases.
  • User‑in‑the‑loop transparency & deployment.

Contact

For questions or collaboration, please contact Chen Borun or Lian Tuzhi. We welcome discussions on verifiable RAG and multi‑hop reasoning.