BioBrief is a specialized "Sovereign AI" system designed to solve the vocabulary mismatch problem in clinical trial recruitment. Unlike generic search engines, BioBrief acts as an intelligent agent that reads patient profiles in natural language, retrieves relevant studies using domain-specific embeddings, and strictly validates eligibility criteria using Large Language Models (LLMs).
Status: Production Prototype (MVP)
Architecture: Retrieval-Augmented Generation (RAG) + Agentic Reasoning
Clinical trial recruitment is broken. The official database (ClinicalTrials.gov) relies on exact boolean logic (e.g., searching "Heart Attack" misses trials labeled "Myocardial Infarction"). Patients and general practitioners struggle to navigate dense medical jargon, causing 80% of trials to be delayed due to lack of participants.
BioBrief is a Verticalized Agentic RAG System.
- Verticalized: It uses
BioBERT(trained on PubMed), not generic BERT. It "understands" that Keytruda and Pembrolizumab are the same drug. - Agentic: It doesn't just retrieve documents. It uses a Reasoning Engine to "read" the inclusion/exclusion criteria and assign a confidence score (
High,Medium,Low) based on the patient's specific history. - Sovereign: It is model-agnostic. The privacy-aware architecture allows switching between Cloud (OpenAI/Gemini) and Local (Ollama/Llama 3) models instantly.
- Streaming ETL Pipeline: Capable of ingesting the massive 10GB+ ClinicalTrials.gov dataset on standard hardware (16GB RAM) without crashing, using
ijsonstream parsing. - Semantic Search: Uses Cosine Similarity via HNSW indexing to find conceptually similar trials, not just keyword matches.
- Structured AI Output: Forces LLMs to output strict JSON, preventing hallucinations and ensuring consistent UI rendering.
- Multi-Provider Support: Built with the Factory Pattern, allowing hot-swapping between:
- Google Gemini 2.0 Flash (Speed & Context)
- OpenAI GPT-4o (Reasoning)
- Local Llama 3.1 / Mistral-Nemo (Privacy & Cost)
BioBrief adheres to modern "AI Engineering" standards, moving away from simple scripts to modular, scalable micro-components.
| Layer | Technology | Standard / Justification |
|---|---|---|
| Ingestion | ijson (Streaming) |
Generator Pipelines: We treat data as a stream, not a batch. This allows processing 500k+ studies without loading the 10GB file into RAM. |
| Embeddings | BioBERT (PubMedBERT) |
Small Specialized Models (SSMs): Instead of massive generic models, we use a tiny, medically-trained model (running on CPU) for superior domain accuracy. |
| Storage | ChromaDB |
Vector Native: Uses HNSW (Hierarchical Navigable Small World) graphs for sub-millisecond semantic retrieval. |
| Brain | Gemini / Llama 3 |
Constrained Generation: We use "JSON Mode" to force the LLM to act as a structured data extractor, not a chatbot. |
| Interface | Streamlit |
Data Apps: Separates frontend state management from backend inference logic. |
- Language: Python 3.13
- Frontend: Streamlit
- Vector DB: ChromaDB (Persistent)
- Embeddings:
sentence-transformers(pritamdeka/S-PubMedBert-MS-MARCO) - LLM Integration:
google-genai(v1.0),openai,ollama - Data Processing:
ijson(Iterative JSON parser),pandas
git clone [https://github.com/irtazaakram/biobrief.git](https://github.com/irtazaakram/biobrief.git)
cd biobrief
pip install -r requirements.txt
This project processes the full ClinicalTrials.gov dataset (10GB+). You must download it manually as it is ignored by Git.
- Go to ClinicalTrials.gov.
- click "Download".
- Place
ctg-studies.jsonin the root folder of this project.
Rename .env.example to .env and configure your preferred AI provider.
For Google Gemini (Recommended):
LLM_PROVIDER=gemini
GOOGLE_API_KEY=your_actual_api_key
For Local Privacy (Ollama):
LLM_PROVIDER=local
LOCAL_LLM_URL=http://localhost:11434/v1
LOCAL_MODEL_NAME=mistral-nemo
Run the ingestion pipeline. This reads the 10GB JSON stream, filters for "Recruiting" trials, creates embeddings, and saves them to ./data/chroma_db.
python src/ingestion.py
Note: This may take 10-20 minutes depending on your CPU.
streamlit run app.py
We implement a Lazy Loading pattern. The script opens a file handle to the 10GB JSON but reads only one study object at a time.
- It extracts:
Title,Summary,Conditions, andEligibility Criteria. - It concatenates them into a rich Semantic Document.
- It vectorizes them using BioBERT (on CPU to avoid Mac MPS crashes).
- It commits them to ChromaDB in batches of 50 to optimize I/O.
When a user searches "Child with Type 1 Diabetes":
- The query is converted to a vector
[0.02, 0.91, ...]. - ChromaDB performs a Cosine Similarity Search to find trials with the closest vector distance.
- This retrieves relevant trials even if they use different words (e.g., "Juvenile Onset Diabetes").
A RAG system is only as good as its facts. We don't just dump the search results.
- We pass the Patient Profile and the Trial Criteria to the LLM.
- We use a Zero-Shot Prompt with strict JSON formatting rules.
- The LLM acts as a "Virtual Oncologist," logically checking if the patient meets the specific inclusion/exclusion rules (Age, Stage, Prior Treatments).
BioBrief/
βββ .env # API Keys and Config
βββ ctg-studies.json # (Ignored) 10GB Dataset
βββ app.py # Main UI
βββ requirements.txt # Dependencies
βββ src/
β βββ ingestion.py # Streaming ETL Pipeline
β βββ llm_engine.py # Multi-LLM Factory
β βββ retrieval.py # Semantic Search Logic
βββ data/ # Vector Database Storage