State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLAVA-MED and BIOMEDGPT, primarily depend on scaling model size and data volume, with training driven largely by autoregressive objectives. However, we reveal that this approach can lead to weak vision-language alignment, making these models overly dependent on costly instruction-following data.
To address this, we introduce EXGRA-MED, a novel multi-graph alignment framework that jointly aligns images, instruction responses, and extended captions in the latent space, advancing semantic grounding and cross-modal coherence. To scale to large LLMs (e.g., LLaMa-7B), we develop an efficient end-to-end training scheme using black-box gradient estimation, enabling fast and scalable optimization.
โ Reveals the data inefficiency of autoregressive modeling โ LLaVA-Med exhibits a significant performance drop when pre-trained on limited data, even after full fine-tuning on downstream tasks.
โ Matches LLaVA-Medโs performance on Medical VQA using only 10% of the pre-training data, demonstrating the data efficiency of EXGRA-MED.
โ Surpasses several SOTA medical multi-modal LLMs when pre-trained on the full PMC-15M dataset (100%) with LLaMA-7B, across diverse tasks:
- (i) Medical Visual Question Answering (VQA)
- (ii) Medical Visual Chatbot
- (iii) Zero-shot Image Classification (as a VQA task)
- ๐ฃ News
- ๐ฆ Model Checkpoints
- ๐ ๏ธ Requirements and Installation
- ๐ Project Structure
- ๐ Dataset Configuration Files
- ๐๏ธ Extended Instructions Generation
- Pre-Training on Two Stages
- ๐ง Fine-tuning on VQA Tasks
- ๐ Evaluation
- ๐ฌ Data Efficiency Demonstration (10% vs 40%)
- ๐ Citation
- [Dec 2025] ๐ฃ The paper has been accepted at NeurIPS 2025!
- [Jun 2025] ๐ Initial codebase release (preprocessing + VQA fine-tuning).
- [Jun 2025] ๐งฉ Checkpoints for EXGRA-MED + DCI and three VQA fine-tuned models now available.
- [Jun 2025] ๐ Evaluation scripts and demo for the data-efficiency benchmark for VQA are online.
- Coming Soon ๐ง Evaluation Scripts for Medical Visual Chatbot and Zero-shot Image Classification.
- Coming Soon ๐ง ExGra-Med checkpoints are trained at large-scale data with 2.5M instruction tuning samples from MedTrinity-25M (10%).
| Model | Description | ๐ค Download Link |
|---|---|---|
llava-med-10 |
LLaVa-Med (10% pre-trained PMC-15M) | Link |
llava-med-40 |
LLaVa-Med (40% pre-trained PMC-15M) | Link |
exgra-med-10 |
ExGra-Med (10% pre-trained PMC-15M) | Link |
exgra-med-40 |
ExGra-Med (40% pre-trained PMC-15M) | Link |
exgra-med |
Our base EXGRA-MED model (100% pre-trained PMC-15M) | Link |
exgra-med-dci |
EXGRA-MED + DCI-enhanced version | Link |
exgra-med-dci-vqa-rad |
Fine-tuned on VQA-RAD | Link |
exgra-med-dci-slake |
Fine-tuned on SLAKE | Link |
exgra-med-dci-pathvqa |
Fine-tuned on PATH-VQA | Link |
Before starting the finetuning/inference/evaluation, download our finetuned checkpoints.
Download Checkpoints
cd pretrained/
# pip install -U huggingface_hub
# Download MERGE-Group/llava-med-10
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/llava-med-10 --local-dir llava-med-10
# Download MERGE-Group/llava-med-40
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/llava-med-40 --local-dir llava-med-40
# Download MERGE-Group/exgra-med-10
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-10 --local-dir exgra-med-10
# Download MERGE-Group/exgra-med-40
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-40 --local-dir exgra-med-40
# Download MERGE-Group/exgra-med
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med --local-dir exgra-med
# Download MERGE-Group/exgra-med-dci
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-dci --local-dir exgra-med-dci
# Download MERGE-Group/exgra-med-dci-vqa-rad
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-dci-vqa-rad --local-dir /exgra-med-dci-vqa-rad
# Download MERGE-Group/exgra-med-dci-slake
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-dci-slake --local-dir /exgra-med-dci-slake
# Download MERGE-Group/exgra-med-dci-pathvqa
huggingface-cli download --resume-download --local-dir-use-symlinks False MERGE-Group/exgra-med-dci-pathvqa --local-dir /exgra-med-dci-pathvqa
Basic Dependencies:
- Python >= 3.10
- Pytorch
- CUDA driver
Note: Please check your CUDA driver to install a proper version of PyTorch. For instance, we provide a guideline for installation for CUDA 11:
conda create -n exgra-med python=3.10.12
conda activate exgra-med
pip install --upgrade pip
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install openai==0.27.8
pip install git+https://github.com/huggingface/transformers@cae78c46
pip install -e .
pip install einops ninja open-clip-torch shortuuid nltk
pip install --upgrade pillowAlso, based on your CUDA driver, please check the proper version of Flash Attention 2 at this link, and then install flash-attn package:
pip install flash-attn==2.5.7 --no-build-isolationBefore running the pre-training stages, please install the following graph-related packages.
pip install pyg-lib==0.3.1 \ torch-scatter==2.1.2+pt21cu118 \ torch-sparse==0.6.18+pt21cu118 \ torch-cluster==1.6.3+pt21cu118 \ torch-spline-conv==1.2.2+pt21cu118 \ -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install torch-geometric==2.4.0
pip install --no-build-isolation git+https://github.com/mrolinek/lpmp.git@9fd6211c77a14beeb68cbc3a98e1b318e614c493
# If you meet errors during LPMP installation, verify your CMake setup and consider reinstalling it.
pip install pycocotoolsassets/: Contains various assets used by the project (e.g., images, supplementary files).scripts/: Houses utility bash scripts.exgra_med/: The main source code directory for theexgra_medpackage/application.data_preprocessing/: Scripts and modules related to data preprocessing.llava/: Specific modules or components related tollava.eval/: Code for evaluatingllavamodels.instruct/: Code related to instructingllavamodels.model/: Containsllavamodel definitions or related utilities.notebook/: Jupyter notebooks for experimentation or demonstration related tollava.serve/: Code for servingllavamodels (e.g., API endpoints).train/: Training scripts and configurations forllava.
untar_files.py: A Python script possibly used for decompressing or extracting files.
LICENSE: The license under which the project is distributed.pyproject.toml: A file used for specifying project build system requirements and project metadata (part of PEP 517/518).README.md: This README file, providing an overview of the project.
๐ Downstream Stage:
We provide pre-built .json configuration files for all datasets used in VQA training and evaluation in downstream tasks. These files specify paths, splits, and preprocessing parameters necessary for seamless execution. Firstly, create each dataset folder in folder data/, then put the corresponding dataset .json files into folders. Next, please see websites for datasets VQA-RAD, SLAKE 1.0, and PATH-VQA to download images/ folders and upload them into corresponding dataset folders in data/ folder.
| Dataset | Task | Config File Description | Download Link |
|---|---|---|---|
| VQA-RAD | VQA | Train/val splits, QA pairs | link |
| SLAKE | VQA | Train/val splits, QA pairs | link |
| PATH-VQA | VQA | Train/val splits, QA pairs | link |
๐ Pre-Training Stage:
To prepare dataset for pre-training stage using both Exgra-Med and original LLaVA-Med algorithms, downloading .json files for stage 1 (alignment) and stage 2 (instruction) in this link. Next, create new folder pretraining_data/ in data/ folder and upload downloaded jsons combined with images/ folder into pretraining_data/ folder. Please note that update dataset file paths inside .sh training files in scripts/ folder if needed to match your local dataset locations.
The script extended_caption_generation.py reads an input JSON of instructions/conversations, sends each question+answer pair to an LLM with a provided system prompt, and replaces the answer with the LLM-provided revision. It supports resuming from an existing extended output file.
Input JSON should contain items with a conversations (or misspelled conversatons) key whose value is a list of role objects. The script pairs even-indexed entries (questions) with the following odd-indexed entries (answers) and updates the answer value with the LLM revision.
Output file: if not resuming, a timestamped file is created next to the input file with suffix _extended_<model>_<timestamp>.json. When resuming, pass --resume_from to continue.
Create a .env file (or set environment variables) with the OpenRouter/OpenAI endpoint and API key used by the OpenAI client. Example .env:
OPENROUTER_ENDPOINT=https://openrouter.ai/api/v1
OPENROUTER_API_KEY=sk-...
Basic usage:
python extended_caption_generation.py
--original_instruction_fpath path/to/original_instructions.json
--system_prompt_fpath path/to/system_prompt.txtOptions:
--original_instruction_fpath(required): Path to input JSON with instructions/conversations.--system_prompt_fpath(required): Path to a text file containing the system prompt to send to the LLM.--resume_from(optional): Path to an existing extended JSON to resume from (skips already-processed ids).--model_name(optional): LLM model identifier (default:openai/gpt-4o-mini). List of reported models: openai/gpt-4o, google/gemini-2.5-flash, qwen/qwen3-8b.
Example with model and resume:
python extended_caption_generation.py
--original_instruction_fpath data/original.json
--system_prompt_fpath prompts/system_prompt.txt
--model_name openai/gpt-4o-mini
--resume_from data/original_extended_gpt-4o-mini_20250101_120000.jsonWe provide ready-to-use scripts to pre-train EXGRA-MED on 2 stages named stage1.sh and stage2.sh in scripts/ folder after downloading pre-training dataset. Make sure to update the --model_name_or_path, --data_path, and --image_folder in each .sh file to point to the correct location of the downloaded model and dataset.
# Example: Pre-train EXGRA-Med on stage 1
bash scripts/stage1.sh
# Example: Pre-train EXGRA-Med on stage 2
bash scripts/stage2.shWe provide ready-to-use scripts to fine-tune EXGRA-MED and EXGRA-MED + DCI on three popular medical VQA benchmarks: VQA-RAD, SLAKE, and PATH-VQA.
Each script uses one of our pretrained checkpoints as the starting point. ๐ Before running, make sure to update the --model_name_or_path in each .sh file to point to the correct location of the downloaded model.
# Example: Fine-tune on VQA-RAD
bash scripts/llava1-5_stage2_data_rad.sh # without DCI
bash scripts/llava1-5_stage2_data_rad_dci.sh # with DCI
# Fine-tune on SLAKE
bash scripts/llava1-5_stage2_slake.sh # without DCI
bash scripts/llava1-5_stage2_slake_dci.sh # with DCI
# Fine-tune on PATH-VQA
bash scripts/llava1-5_stage2_pvqa.sh # without DCI
bash scripts/llava1-5_stage2_pvqa_dci.sh # with DCIYou can run evaluation for each of the three key tasks:
# supports VQA-RAD, SLAKE, PATH-VQA
# change the following
# --model-name: Path to load the model from finetuning stage
# --answers-file: file to store the result (i.e the answers to the medical question)
python exgra_med/llava/eval/run_med_datasets_eval_batch.py \
--num-chunks 2 \
--model-name \<output_vqa_rad_checkpoint\> \
--mm_dense_connector_type none \
--num_l 6 \
--question-file ./data_RAD/test_w_options_new.json \
--image-folder ./data_RAD/images \
--answers-file \<answers_file\>
#change the following
#--pred: same as --answers-file above
# the metrics (recall and accuracy) are saved as a text file in the same place, with the same name as --pred.
#E.g: if --pred is ans-opt-new-3.jsonl, then metrics are saved in ans-opt-new-3.txt
python exgra_med/llava/eval/run_eval.py \
--gt ./data_RAD/test_w_options_new.json \
--pred \<answers_file\> \
--candidate ./data_RAD/candidate.json๐ง To be updated!
bash scripts/eval_chatbot.sh
By reformulating image classification as visual question answering, we can generate predictions by solving the VQA task with multiple-choice questions. First, download OmniMedVQA benchmark from OpenDataLab or huggingface and unzip it. Follow instructions in the file zero_shot_classification.sh to change variables FILENAME, OUTPUTNAME, WORKDIR and MODEL_CKPT and CONNECTOR_TYPE. Then run:
bash scripts/zero_shot_classification.sh
To replicate our findings on LLAVA-MEDโs data inefficiency and the strength of EXGRA-MED with 10% and 40% data (Tables 1 & 2 in the paper):
# Fine-tune EXGRA-MED with 10%/40% data on VQA task
bash scripts/train_exgra_10percent.sh ## Fine-tune checkpoint LLaVa-Med with 10%/40% data on VQA task
bash scripts/train_llava_10percent.shIf you find this work useful, please cite our paper:
@article{nguyen2025exgra,
title={EXGRA-MED: Extended Context Graph Alignment for Medical Vision- Language Models},
author={Duy M. H. Nguyen, Nghiem T. Diep, Trung Q. Nguyen, Hoang-Bao Le, Tai Nguyen, Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zou, Daniel Sonntag, Mathias Niepert},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}
The data, code, and model checkpoints are intended and licensed for research use only. They are also subject to additional restrictions dictated by the Terms of Use: LLaMA, Vicuna and GPT-4 respectively. The data is made available under CC BY NC 4.0. The data, code, and model checkpoints may be used for non-commercial purposes and any models trained using the dataset should be used only for research purposes. It is expressly prohibited for models trained on this data to be used in clinical care or for any clinical decision making purposes.

