Skip to content

irtazaakram/biobrief

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

🧬 BioBrief: Verticalized Agentic RAG for Clinical Trials

Python Streamlit

BioBrief is a specialized "Sovereign AI" system designed to solve the vocabulary mismatch problem in clinical trial recruitment. Unlike generic search engines, BioBrief acts as an intelligent agent that reads patient profiles in natural language, retrieves relevant studies using domain-specific embeddings, and strictly validates eligibility criteria using Large Language Models (LLMs).

Status: Production Prototype (MVP)

Architecture: Retrieval-Augmented Generation (RAG) + Agentic Reasoning


Project Overview

The Problem

Clinical trial recruitment is broken. The official database (ClinicalTrials.gov) relies on exact boolean logic (e.g., searching "Heart Attack" misses trials labeled "Myocardial Infarction"). Patients and general practitioners struggle to navigate dense medical jargon, causing 80% of trials to be delayed due to lack of participants.

The Solution: BioBrief

BioBrief is a Verticalized Agentic RAG System.

  • Verticalized: It uses BioBERT (trained on PubMed), not generic BERT. It "understands" that Keytruda and Pembrolizumab are the same drug.
  • Agentic: It doesn't just retrieve documents. It uses a Reasoning Engine to "read" the inclusion/exclusion criteria and assign a confidence score (High, Medium, Low) based on the patient's specific history.
  • Sovereign: It is model-agnostic. The privacy-aware architecture allows switching between Cloud (OpenAI/Gemini) and Local (Ollama/Llama 3) models instantly.

Key Features

  • Streaming ETL Pipeline: Capable of ingesting the massive 10GB+ ClinicalTrials.gov dataset on standard hardware (16GB RAM) without crashing, using ijson stream parsing.
  • Semantic Search: Uses Cosine Similarity via HNSW indexing to find conceptually similar trials, not just keyword matches.
  • Structured AI Output: Forces LLMs to output strict JSON, preventing hallucinations and ensuring consistent UI rendering.
  • Multi-Provider Support: Built with the Factory Pattern, allowing hot-swapping between:
    • Google Gemini 2.0 Flash (Speed & Context)
    • OpenAI GPT-4o (Reasoning)
    • Local Llama 3.1 / Mistral-Nemo (Privacy & Cost)

Architecture

BioBrief adheres to modern "AI Engineering" standards, moving away from simple scripts to modular, scalable micro-components.

Layer Technology Standard / Justification
Ingestion ijson (Streaming) Generator Pipelines: We treat data as a stream, not a batch. This allows processing 500k+ studies without loading the 10GB file into RAM.
Embeddings BioBERT (PubMedBERT) Small Specialized Models (SSMs): Instead of massive generic models, we use a tiny, medically-trained model (running on CPU) for superior domain accuracy.
Storage ChromaDB Vector Native: Uses HNSW (Hierarchical Navigable Small World) graphs for sub-millisecond semantic retrieval.
Brain Gemini / Llama 3 Constrained Generation: We use "JSON Mode" to force the LLM to act as a structured data extractor, not a chatbot.
Interface Streamlit Data Apps: Separates frontend state management from backend inference logic.

Tech Stack

  • Language: Python 3.13
  • Frontend: Streamlit
  • Vector DB: ChromaDB (Persistent)
  • Embeddings: sentence-transformers (pritamdeka/S-PubMedBert-MS-MARCO)
  • LLM Integration: google-genai (v1.0), openai, ollama
  • Data Processing: ijson (Iterative JSON parser), pandas

Installation & Setup

1. Clone the Repository

git clone [https://github.com/irtazaakram/biobrief.git](https://github.com/irtazaakram/biobrief.git)
cd biobrief

2. Install Dependencies

pip install -r requirements.txt

3. Download the Dataset (Critical Step)

This project processes the full ClinicalTrials.gov dataset (10GB+). You must download it manually as it is ignored by Git.

  1. Go to ClinicalTrials.gov.
  2. click "Download".
  3. Place ctg-studies.json in the root folder of this project.

4. Configure Environment

Rename .env.example to .env and configure your preferred AI provider.

For Google Gemini (Recommended):

LLM_PROVIDER=gemini
GOOGLE_API_KEY=your_actual_api_key

For Local Privacy (Ollama):

LLM_PROVIDER=local
LOCAL_LLM_URL=http://localhost:11434/v1
LOCAL_MODEL_NAME=mistral-nemo

5. Build the Knowledge Base (One-Time Setup)

Run the ingestion pipeline. This reads the 10GB JSON stream, filters for "Recruiting" trials, creates embeddings, and saves them to ./data/chroma_db.

python src/ingestion.py

Note: This may take 10-20 minutes depending on your CPU.

6. Run the Application

streamlit run app.py

How It Works

1. Ingestion (src/ingestion.py)

We implement a Lazy Loading pattern. The script opens a file handle to the 10GB JSON but reads only one study object at a time.

  • It extracts: Title, Summary, Conditions, and Eligibility Criteria.
  • It concatenates them into a rich Semantic Document.
  • It vectorizes them using BioBERT (on CPU to avoid Mac MPS crashes).
  • It commits them to ChromaDB in batches of 50 to optimize I/O.

2. Retrieval (src/retrieval.py)

When a user searches "Child with Type 1 Diabetes":

  • The query is converted to a vector [0.02, 0.91, ...].
  • ChromaDB performs a Cosine Similarity Search to find trials with the closest vector distance.
  • This retrieves relevant trials even if they use different words (e.g., "Juvenile Onset Diabetes").

3. Agentic Validation (src/llm_engine.py)

A RAG system is only as good as its facts. We don't just dump the search results.

  • We pass the Patient Profile and the Trial Criteria to the LLM.
  • We use a Zero-Shot Prompt with strict JSON formatting rules.
  • The LLM acts as a "Virtual Oncologist," logically checking if the patient meets the specific inclusion/exclusion rules (Age, Stage, Prior Treatments).

Project Structure

BioBrief/
β”œβ”€β”€ .env                 # API Keys and Config
β”œβ”€β”€ ctg-studies.json     # (Ignored) 10GB Dataset
β”œβ”€β”€ app.py               # Main UI
β”œβ”€β”€ requirements.txt     # Dependencies
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ ingestion.py     # Streaming ETL Pipeline
β”‚   β”œβ”€β”€ llm_engine.py    # Multi-LLM Factory
β”‚   └── retrieval.py     # Semantic Search Logic
└── data/                # Vector Database Storage

About

Verticalized Agentic RAG for Clinical Trials

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages