🧬 BioBrief: Verticalized Agentic RAG for Clinical Trials

BioBrief is a specialized "Sovereign AI" system designed to solve the vocabulary mismatch problem in clinical trial recruitment. Unlike generic search engines, BioBrief acts as an intelligent agent that reads patient profiles in natural language, retrieves relevant studies using domain-specific embeddings, and strictly validates eligibility criteria using Large Language Models (LLMs).

Status: Production Prototype (MVP)

Architecture: Retrieval-Augmented Generation (RAG) + Agentic Reasoning

Project Overview

The Problem

Clinical trial recruitment is broken. The official database (ClinicalTrials.gov) relies on exact boolean logic (e.g., searching "Heart Attack" misses trials labeled "Myocardial Infarction"). Patients and general practitioners struggle to navigate dense medical jargon, causing 80% of trials to be delayed due to lack of participants.

The Solution: BioBrief

BioBrief is a Verticalized Agentic RAG System.

Verticalized: It uses BioBERT (trained on PubMed), not generic BERT. It "understands" that Keytruda and Pembrolizumab are the same drug.
Agentic: It doesn't just retrieve documents. It uses a Reasoning Engine to "read" the inclusion/exclusion criteria and assign a confidence score (High, Medium, Low) based on the patient's specific history.
Sovereign: It is model-agnostic. The privacy-aware architecture allows switching between Cloud (OpenAI/Gemini) and Local (Ollama/Llama 3) models instantly.

Key Features

Streaming ETL Pipeline: Capable of ingesting the massive 10GB+ ClinicalTrials.gov dataset on standard hardware (16GB RAM) without crashing, using ijson stream parsing.
Semantic Search: Uses Cosine Similarity via HNSW indexing to find conceptually similar trials, not just keyword matches.
Structured AI Output: Forces LLMs to output strict JSON, preventing hallucinations and ensuring consistent UI rendering.
Multi-Provider Support: Built with the Factory Pattern, allowing hot-swapping between:
- Google Gemini 2.0 Flash (Speed & Context)
- OpenAI GPT-4o (Reasoning)
- Local Llama 3.1 / Mistral-Nemo (Privacy & Cost)

Architecture

BioBrief adheres to modern "AI Engineering" standards, moving away from simple scripts to modular, scalable micro-components.

Layer	Technology	Standard / Justification
Ingestion	`ijson` (Streaming)	Generator Pipelines: We treat data as a stream, not a batch. This allows processing 500k+ studies without loading the 10GB file into RAM.
Embeddings	`BioBERT` (PubMedBERT)	Small Specialized Models (SSMs): Instead of massive generic models, we use a tiny, medically-trained model (running on CPU) for superior domain accuracy.
Storage	`ChromaDB`	Vector Native: Uses HNSW (Hierarchical Navigable Small World) graphs for sub-millisecond semantic retrieval.
Brain	`Gemini` / `Llama 3`	Constrained Generation: We use "JSON Mode" to force the LLM to act as a structured data extractor, not a chatbot.
Interface	`Streamlit`	Data Apps: Separates frontend state management from backend inference logic.

Tech Stack

Language: Python 3.13
Frontend: Streamlit
Vector DB: ChromaDB (Persistent)
Embeddings: sentence-transformers (pritamdeka/S-PubMedBert-MS-MARCO)
LLM Integration: google-genai (v1.0), openai, ollama
Data Processing: ijson (Iterative JSON parser), pandas

Installation & Setup

1. Clone the Repository

git clone [https://github.com/irtazaakram/biobrief.git](https://github.com/irtazaakram/biobrief.git)
cd biobrief

2. Install Dependencies

pip install -r requirements.txt

3. Download the Dataset (Critical Step)

This project processes the full ClinicalTrials.gov dataset (10GB+). You must download it manually as it is ignored by Git.

Go to ClinicalTrials.gov.
click "Download".
Place ctg-studies.json in the root folder of this project.

4. Configure Environment

Rename .env.example to .env and configure your preferred AI provider.

For Google Gemini (Recommended):

LLM_PROVIDER=gemini
GOOGLE_API_KEY=your_actual_api_key

For Local Privacy (Ollama):

LLM_PROVIDER=local
LOCAL_LLM_URL=http://localhost:11434/v1
LOCAL_MODEL_NAME=mistral-nemo

5. Build the Knowledge Base (One-Time Setup)

Run the ingestion pipeline. This reads the 10GB JSON stream, filters for "Recruiting" trials, creates embeddings, and saves them to ./data/chroma_db.

python src/ingestion.py

Note: This may take 10-20 minutes depending on your CPU.

6. Run the Application

streamlit run app.py

How It Works

1. Ingestion (`src/ingestion.py`)

We implement a Lazy Loading pattern. The script opens a file handle to the 10GB JSON but reads only one study object at a time.

It extracts: Title, Summary, Conditions, and Eligibility Criteria.
It concatenates them into a rich Semantic Document.
It vectorizes them using BioBERT (on CPU to avoid Mac MPS crashes).
It commits them to ChromaDB in batches of 50 to optimize I/O.

2. Retrieval (`src/retrieval.py`)

When a user searches "Child with Type 1 Diabetes":

The query is converted to a vector [0.02, 0.91, ...].
ChromaDB performs a Cosine Similarity Search to find trials with the closest vector distance.
This retrieves relevant trials even if they use different words (e.g., "Juvenile Onset Diabetes").

3. Agentic Validation (`src/llm_engine.py`)

A RAG system is only as good as its facts. We don't just dump the search results.

We pass the Patient Profile and the Trial Criteria to the LLM.
We use a Zero-Shot Prompt with strict JSON formatting rules.
The LLM acts as a "Virtual Oncologist," logically checking if the patient meets the specific inclusion/exclusion rules (Age, Stage, Prior Treatments).

Project Structure

BioBrief/
├── .env                 # API Keys and Config
├── ctg-studies.json     # (Ignored) 10GB Dataset
├── app.py               # Main UI
├── requirements.txt     # Dependencies
├── src/
│   ├── ingestion.py     # Streaming ETL Pipeline
│   ├── llm_engine.py    # Multi-LLM Factory
│   └── retrieval.py     # Semantic Search Logic
└── data/                # Vector Database Storage

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 BioBrief: Verticalized Agentic RAG for Clinical Trials

Project Overview

The Problem

The Solution: BioBrief

Key Features

Architecture

Tech Stack

Installation & Setup

1. Clone the Repository

2. Install Dependencies

3. Download the Dataset (Critical Step)

4. Configure Environment

5. Build the Knowledge Base (One-Time Setup)

6. Run the Application

How It Works

1. Ingestion (`src/ingestion.py`)

2. Retrieval (`src/retrieval.py`)

3. Agentic Validation (`src/llm_engine.py`)

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 BioBrief: Verticalized Agentic RAG for Clinical Trials

Project Overview

The Problem

The Solution: BioBrief

Key Features

Architecture

Tech Stack

Installation & Setup

1. Clone the Repository

2. Install Dependencies

3. Download the Dataset (Critical Step)

4. Configure Environment

5. Build the Knowledge Base (One-Time Setup)

6. Run the Application

How It Works

1. Ingestion (src/ingestion.py)

2. Retrieval (src/retrieval.py)

3. Agentic Validation (src/llm_engine.py)

Project Structure

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Ingestion (`src/ingestion.py`)

2. Retrieval (`src/retrieval.py`)

3. Agentic Validation (`src/llm_engine.py`)

Packages