Skip to content

Maryamm-2/Reasoning-Enhanced-RAG-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Reasoning-Enhanced RAG Pipeline

A sophisticated Retrieval-Augmented Generation (RAG) system that combines graph-based knowledge representation, multi-step reasoning, and advanced verification mechanisms to answer scientific queries with high accuracy.


Quick Start

Prerequisites

  • Google Colab account (recommended) or local Python 3.8+
  • GROQ API key (free at https://console.groq.com)
  • GPU recommended but not required

Installation & Setup

  1. Open in Google Colab

    • Upload the notebook or create new cells
  2. Install Dependencies (Cell 1)

    pip install -q groq qdrant-client sentence-transformers scikit-learn kagglehub
  3. Configure API Key (Cell 4)

    • Go to Colab → Secrets (🔑 icon on left sidebar)
    • Add secret: GROQ_API_KEY = your_api_key_here
  4. Run All Cells

    • Runtime → Run all
    • Wait for dataset download (~2-3 minutes first time)
    • Pipeline will automatically initialize
  5. Query the System (Cell 12)

    • Modify test_queries list
    • Run to get answers with citations

Tech Stack & Concepts

  • Core Framework: Agentic RAG, Graph-Enhanced Retrieval (GraphRAG)
  • LLMs & Inference: Llama 3.3 70B, Groq API
  • Vector Database: Qdrant (In-memory)
  • Reasoning Implementation: Chain-of-Thought (CoT), Self-RAG, NLI Verification
  • Graph Processing: NetworkX, Semantic Graph Construction
  • Libraries: sentence-transformers, scikit-learn, kagglehub

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    USER QUERY                                │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: RETRIEVAL PLANNING                                 │
│ ├─ Query Analysis (LLM)                                     │
│ ├─ Sub-query Generation                                     │
│ └─ Strategy Selection (sequential/parallel)                 │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: GRAPH-ENHANCED RETRIEVAL                           │
│ ├─ Vector Search (Qdrant + SentenceTransformer)            │
│ ├─ Query Expansion                                          │
│ ├─ Cross-Encoder Reranking                                  │
│ ├─ Knowledge Graph Walk (optional)                          │
│ └─ NLI Verification (optional)                              │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 3: KNOWLEDGE AGGREGATION                              │
│ ├─ Multi-document Synthesis                                 │
│ ├─ Chain-of-Thought Reasoning (LLM)                         │
│ ├─ Confidence Scoring                                       │
│ └─ Progressive Outline Building                             │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 4: PROBABILISTIC FUSION                               │
│ ├─ Weighted Answer Merging                                  │
│ ├─ Confidence-based Filtering                               │
│ └─ Beam Search Aggregation (LLM)                             │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 5: ANSWER GENERATION & VERIFICATION                   │
│ ├─ Final Answer Synthesis (LLM)                             │
│ ├─ Citation Formatting                                      │
│ ├─ Self-RAG Reflection                                      │
│ └─ Quality Validation                                       │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  ▼
              FINAL ANSWER

Section 3 Compliance

Section 3 Concept Implemented? Evidence in Codebase
Pre-Retrieval Reasoning Yes AdvancedRetrievalPlanner class breaks queries into sub-queries and decides on a strategy (Sequential vs. Parallel).
Reasoning-During-Retrieval Yes KnowledgeGraphBuilder and retrieve_with_graph_reasoning allow the system to "walk" the graph to find non-obvious related documents.
Post-Retrieval Reasoning Yes DeepVerificationEngine performs NLI (Natural Language Inference) to verify that retrieved documents logically entail the hypothesis.
Reasoning-Enhanced Generation Yes ProbabilisticFusionEngine and the final synthesis step use Chain-of-Thought (CoT) and Self-Reflection (checking if the answer is "supported" and "relevant") before outputting the result.

Core Components

1. Data Structures (Cell 5)

ReasoningType (Enum)

Defines different reasoning strategies:

  • CHAIN_OF_THOUGHT - Sequential step-by-step reasoning
  • GRAPH_OF_THOUGHT - Multi-path graph exploration
  • SELF_REFLECTION - Answer validation and critique
  • NLI_VERIFICATION - Natural Language Inference checking
  • PROBABILISTIC_FUSION - Weighted answer combination

ReasoningStep (Dataclass)

Tracks each reasoning iteration:

- step_id: Unique identifier
- query: Current sub-query
- reasoning_type: Strategy used
- retrieved_docs: Source documents
- synthesized_fact: Generated answer
- confidence: Score (0.0-1.0)
- supporting_evidence: Text snippets
- verification_score: NLI score
- graph_path: Navigation history

RetrievalPlan (Dataclass)

Query decomposition strategy:

- query: Original question
- sub_queries: Broken down questions
- retrieval_strategy: "sequential" or "parallel"
- expected_hops: Graph traversal depth
- reasoning_path: Step descriptions

2. Knowledge Graph Builder (Cell 6)

Purpose: Creates semantic connections between documents

Key Functions:

extract_entities(text)

  • Input: Document text
  • Output: List of capitalized entities (names, terms)
  • Method: Simple pattern matching (capitalization + length > 3)

build_graph_from_documents(documents)

  • Process:
    1. Extract entities from each document
    2. Generate embeddings using SentenceTransformer
    3. Create nodes with content, entities, embeddings
    4. Build edges between nodes sharing entities
  • Output: NetworkX directed graph

walk_on_graph(start_nodes, max_hops=3)

  • Purpose: Multi-hop document discovery
  • Algorithm:
    1. Start from initial retrieved documents
    2. Explore neighbors up to max_hops away
    3. Prioritize high-weight connections (entity overlap)
  • Use Case: Finding related information not in initial search

Example:

Query: "H. pylori and cancer"
Initial Doc: "H. pylori causes infection"
Graph Walk → Finds: "Infections lead to cancer risk"

3. Retrieval Planner (Cell 7)

Purpose: Decomposes complex queries into sub-queries

Class: AdvancedRetrievalPlanner

create_retrieval_plan(query)

  • Process:
    1. Send query to LLM (Llama 3.3 70B)
    2. Request JSON structured plan
    3. Parse sub-queries, strategy, expected hops
  • Fallback: Single-query direct retrieval

Example Output:

{
  "sub_queries": [
    "What is H. pylori?",
    "How does H. pylori cause cancer?"
  ],
  "retrieval_strategy": "sequential",
  "expected_hops": 2,
  "reasoning_path": ["Define pathogen", "Explain mechanism"]
}

4. Verification Engine (Cell 8)

Purpose: Validates answer accuracy and relevance

Class: DeepVerificationEngine

verify_entailment(premise, hypothesis)

  • Input:
    • Premise: Retrieved document text
    • Hypothesis: Query or claim
  • Process: LLM checks if premise supports hypothesis
  • Output:
    {
      "entails": true/false,
      "score": 0.0-1.0,
      "label": "entailment/neutral/contradiction"
    }

self_rag_reflect(generated_text, evidence)

  • Purpose: Self-critique mechanism from Self-RAG paper
  • Checks:
    1. Relevant: Does answer address query?
    2. Supported: Is answer backed by evidence?
    3. Useful: Is answer informative?
  • Special Case: Recognizes "not found" statements as valid
  • Output: Scores + list of issues

5. Fusion Engine (Cell 9)

Purpose: Combines multiple answers into coherent response

Class: ProbabilisticFusionEngine

beam_aggregate(sub_queries, answers)

  • Algorithm:
    1. Filter answers by confidence threshold (>0.3)
    2. Weight by confidence scores
    3. Send to LLM for fusion
  • Output: Combined answer with overall confidence

progressive_aggregation(reasoning_steps)

  • Purpose: Build hierarchical outline
  • Process: Extract key points from each step
  • Output: Structured outline with confidence per point

6. Hybrid Retriever (Cell 10)

Purpose: Core retrieval system with multiple strategies

Class: EnhancedHybridRetriever

Models Used:

  • Embedder: all-MiniLM-L6-v2 (384 dimensions)
  • Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2
  • Vector DB: Qdrant (in-memory)

retrieve_with_graph_reasoning(query, top_k, use_graph)

Step-by-Step Process:

  1. Query Expansion

    query_expanded = f"{query} research study evidence"
    • Adds context terms to improve recall
  2. Vector Search

    search_results = qdrant.query_points(
        query=query_vector,
        limit=20  # Over-retrieve for reranking
    )
    • Cosine similarity search
    • Returns 20 candidates
  3. Cross-Encoder Reranking

    pairs = [[query, text] for text in candidates]
    rerank_scores = reranker.predict(pairs)
    • More accurate than embeddings
    • Scores each query-document pair
    • Sorts by rerank score
  4. Graph Enhancement (optional)

    start_nodes = [top_3_results]
    expanded = kg_builder.walk_on_graph(start_nodes)
    • Adds related documents
    • Increases diversity
  5. Final Selection

    • Returns top_k documents
    • Includes rerank scores

7. Complete Pipeline (Cell 11)

Class: CompleteReasoningRAGPipeline

Main Method: process_query_with_all_techniques(query, enable_graph, enable_nli)


STAGE 1: Retrieval Planning

plan = self.planner.create_retrieval_plan(query)

Actions:

  • LLM analyzes query complexity
  • Generates sub-queries if needed
  • Determines retrieval strategy

Output Example:

Strategy: sequential
Sub-queries: 1

STAGE 2: Graph-Enhanced Retrieval

For each sub-query:

retrieved = self.retriever.retrieve_with_graph_reasoning(
    sub_query, top_k=5, use_graph=True
)

Actions:

  1. Vector search (20 candidates)
  2. Rerank to top 5
  3. Optional: NLI verification of top result
  4. Create ReasoningStep with:
    • Retrieved documents
    • Synthesized fact (via _synthesize_with_cot)
    • Confidence score
    • Supporting evidence

STAGE 3: Knowledge Aggregation

progressive_result = self.fusion_engine.progressive_aggregation(
    reasoning_steps
)

Actions:

  • Build outline from all steps
  • Calculate completeness (% of steps processed)
  • Prepare for fusion

Output:

{
  "outline": [
    {"point": "H. pylori is a pathogen...", "confidence": 0.8},
    {"point": "It causes cancer by...", "confidence": 0.7}
  ],
  "completeness": 1.0
}

STAGE 4: Probabilistic Fusion

fused = self.fusion_engine.beam_aggregate(
    plan.sub_queries, answers
)

Actions:

  1. Weight answers by confidence
  2. Filter low-confidence (<0.3)
  3. LLM combines into coherent text
  4. Calculate fusion confidence

Output:

{
  "fused_answer": "Combined text...",
  "confidence": 0.6
}

STAGE 5: Final Answer Generation

final_answer = self._generate_final_answer(
    query, reasoning_steps, fused
)

Process:

  1. Source Collection

    all_sources = []
    for step in reasoning_steps:
        for doc in step.retrieved_docs[:2]:
            all_sources.append(doc['text'][:400])
  2. Prompt Construction

    Question: {query}
    Sources: [Source 1], [Source 2]...
    
    Instructions:
    - Use ONLY provided sources
    - Cite as [1], [2], etc.
    - State if information not found
    
  3. LLM Generation

    • Model: Llama 3.3 70B
    • Temperature: 0.1 (deterministic)
    • Max tokens: 600
  4. Quality Validation

    if 'not found' in answer:
        confidence = 0.2  # Lower for missing info
    elif not sources_support_answer:
        confidence = 0.3  # Penalize unsupported claims

STAGE 6: Self-RAG Reflection

reflection = self.verifier.self_rag_reflect(
    final_answer, all_evidence
)

Checks:

  • ✅ Relevant: Answer addresses query?
  • ✅ Supported: Claims backed by evidence?
  • ✅ Useful: Provides value to user?

Output:

{
  "relevant": false,
  "supported": false,
  "useful": true,
  "overall_score": 0.3,
  "issues": ["Missing key mechanism details"]
}

📚 Dataset & Indexing (Cell 11)

SciFact Dataset

  • Source: Kaggle (via kagglehub)
  • Size: 5,183 scientific abstracts
  • Domain: Biomedical research papers
  • Format: CSV with columns: doc_id, title, abstract, structured

Text Cleaning

def clean_text(text):
    if not text or text == 'nan':
        return ''
    text = ' '.join(text.split())  # Remove extra whitespace
    if len(text) < 100:  # Filter very short texts
        return ''
    return text

Indexing Process

  1. Initialize Qdrant

    qdrant_client = QdrantClient(":memory:")
    • In-memory vector database
    • Fast for <10K documents
  2. Create Collection

    vectors_config = VectorParams(
        size=384,  # MiniLM embedding dimension
        distance=Distance.COSINE
    )
  3. Generate Embeddings

    for doc in corpus:
        text = clean_text(doc['abstract'])
        vector = embedding_model.encode(text)
        qdrant.upload_point(id, vector, payload)
  4. Build Knowledge Graph

    • Scrolls through first 500 documents
    • Extracts entities
    • Creates graph edges

Final Stats:

✅ Indexed 5,183 documents
✅ Graph: 500 nodes, 79 edges

Query Execution Flow

Example: "How does cagPAI-positive H. pylori affect AID expression?"

1. Planning (1 API call)

→ LLM: Analyze query
← Response: Single complex query, sequential strategy

2. Retrieval

→ Vector DB: Search for "cagPAI H. pylori AID expression research study evidence"
← Results: 20 candidates
→ Reranker: Score all pairs
← Top 5 documents (including target paper at index 3354)

3. NLI Verification (1 API call)

→ LLM: Does top document support query?
← Score: 0.50 (moderate support)

4. Synthesis (1 API call)

→ LLM: Answer using these 3 sources
← "cagPAI-positive H. pylori induces aberrant AID expression via IκB kinase"
   Confidence: 0.6

5. Fusion (1 API call)

→ LLM: Combine answers (only 1 sub-query here, so pass-through)
← Fused answer with 0.6 confidence

6. Final Answer (1 API call)

→ LLM: Generate answer with citations from 5 sources
← "Infection with cagPAI-positive H. pylori induces aberrant expression 
    of activation-induced cytidine deaminase (AID) via the IκB kinase [1]."
   Confidence: 1.0

7. Reflection (1 API call)

→ LLM: Validate answer against evidence
← Relevant: False, Supported: False, Score: 0.3
   (Stricter than actual quality - known issue)

Total: 6 API calls, ~3,400 tokens


Key Algorithms & Strategies

1. Query Expansion

query_expanded = f"{query} research study evidence"
  • Why: Improves recall by adding context
  • Tradeoff: May reduce precision
  • Solution: Reranking filters noise

2. Two-Stage Retrieval

Stage 1: Fast vector search (retrieve 20)
Stage 2: Slow cross-encoder rerank (score all, keep 5)
  • Benefits: Best of both worlds - speed + accuracy
  • Cost: 20 reranker inferences per query

3. Temperature Control

# Planning: temperature=0.2 (creative decomposition)
# Synthesis: temperature=0.1 (faithful to sources)
# Fusion: temperature=0.2 (balanced combination)

4. Confidence Cascading

Retrieval score (0-1)
  → Synthesis confidence (0-1)
    → Fusion confidence (0-1)
      → Final confidence (0-1)

Lower scores propagate downstream

5. Progressive Outlining

  • Builds answer incrementally
  • Each step adds to outline
  • Tracks completeness metric

Performance Metrics

Accuracy Indicators

  1. Confidence Score

    • Range: 0.0 - 1.0
    • Source: LLM self-assessment
    • High (>0.8): Strong source support
    • Low (<0.3): Weak/no evidence
  2. Self-RAG Score

    • Range: 0.0 - 1.0
    • Components: Relevant + Supported + Useful
    • Known Issue: Often underestimates quality (30% for correct answers)
  3. Completeness

    • Percentage of sub-queries answered
    • Always 100% for single-query plans

Resource Usage

  • API Calls: 5-6 per query (simple), 15-20 (complex multi-step)
  • Tokens: ~3,000-5,000 per query
  • Time: 5-10 seconds per query
  • Memory: ~2GB for full dataset in RAM

Configuration Options

When Calling Pipeline

result = rag_pipeline.process_query_with_all_techniques(
    query="Your question",
    enable_graph=True,   # Use knowledge graph walking
    enable_nli=False     # Skip NLI verification (saves API calls)
)

Adjustable Parameters

In EnhancedHybridRetriever:

top_k=5           # Number of documents to retrieve
limit=20          # Candidates before reranking
max_hops=3        # Graph walk depth

In CompleteReasoningRAGPipeline:

temperature=0.1   # LLM creativity (lower = more deterministic)
max_tokens=600    # Answer length limit

In APICounter:

max_calls=100     # Stop after N API calls

Known Issues & Limitations

1. Low Self-RAG Scores

  • Issue: Reflection often marks correct answers as unsupported (30%)
  • Cause: Overly strict verification prompts
  • Impact: Misleading quality metric
  • Solution: Ignore Self-RAG score, trust Confidence instead

2. Graph Building Limited

  • Issue: Only first 500 docs used for graph
  • Cause: Performance optimization
  • Impact: May miss distant connections
  • Solution: Increase scroll limit=500 to limit=2000

3. Entity Extraction Naive

  • Issue: Simple capitalization-based
  • Cause: No NER model used
  • Impact: Misses lowercase entities, over-includes
  • Solution: Use spaCy or Transformers NER

4. No Caching

  • Issue: Same query reprocesses everything
  • Cause: No cache implementation
  • Impact: Wastes API calls
  • Solution: Add LRU cache for embeddings/answers

Advanced Customization

Add New Query Type

# In AdvancedRetrievalPlanner.create_retrieval_plan()

if "compare" in query.lower():
    return RetrievalPlan(
        sub_queries=[
            f"What is {entity1}?",
            f"What is {entity2}?",
            f"Compare {entity1} and {entity2}"
        ],
        retrieval_strategy="parallel",
        expected_hops=2
    )

Change Reranking Model

# In EnhancedHybridRetriever.__init__()

self.reranker = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')
# Faster, slightly less accurate

Add BM25 Sparse Retrieval

from rank_bm25 import BM25Okapi

# Tokenize corpus
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized)

# Hybrid retrieval
vector_scores = qdrant.query(...)
bm25_scores = bm25.get_scores(query.split())
combined = 0.7*vector_scores + 0.3*bm25_scores

References & Citations

Papers Implemented:

  1. Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs (arXiv:2507.09477)
    • Note: The full PDF (2507.09477v2.pdf) is included in this repository.
  2. Self-RAG: Self-Reflective Retrieval-Augmented Generation
  3. Graph-of-Thought: Graph-Based Reasoning for Complex Questions
  4. Fusion-in-Decoder: Probabilistic Answer Fusion

Models Used:

  • Llama 3.3 70B (via GROQ API)
  • all-MiniLM-L6-v2 (Sentence Transformers)
  • ms-marco-MiniLM-L-6-v2 (Cross-Encoder)

Libraries:

  • Qdrant: Vector database
  • NetworkX: Graph operations
  • SentenceTransformers: Embeddings

🤝 Contributing & Support

Common Issues:

  1. "API Key not found"

    • Add GROQ_API_KEY to Colab secrets
  2. "Collection not found"

    • Run Cell 11 (dataset loading) before queries
  3. "Out of memory"

    • Reduce corpus_subset size from 5183 to 1000
  4. "Answers are incorrect"

    • Check if target document is in indexed subset
    • Increase corpus_subset size
    • Lower clean_text threshold to 50 chars

Example Usage

# Simple query
result = rag_pipeline.process_query_with_all_techniques(
    query="What causes gastric cancer?",
    enable_graph=False,  # Faster
    enable_nli=False     # Fewer API calls
)

print(result['final_answer']['answer'])
# → "Infection with Helicobacter pylori is a risk factor..."

# Complex multi-hop query
result = rag_pipeline.process_query_with_all_techniques(
    query="Explain the molecular pathway from H. pylori infection to cancer",
    enable_graph=True,   # Multi-hop reasoning
    enable_nli=True      # Verify each step
)

print(result['reflection']['overall_score'])
# → 0.3 (ignore this, it's often wrong)

print(result['final_answer']['confidence'])
# → 0.8 (trust this instead)

Learning Outcomes

After using this pipeline, you understand:

  • Modern RAG architecture (5-stage)
  • Vector databases (Qdrant)
  • Semantic search (embeddings + reranking)
  • Knowledge graphs for retrieval
  • LLM prompting strategies
  • Answer fusion techniques
  • Self-critique mechanisms
  • Citation generation

Version: 1.0
Last Updated: January 2026
License: MIT

Releases

No releases published

Packages

 
 
 

Contributors