Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

layout

page

title

Evaluation

permalink

/eval/

redirect_from

/usage/

/docs/USAGE

Evaluation Guide

This project adopts a server-client architecture. We require a running OpenAI-compatible LLM/VLM server (e.g., vLLM Serving¹, OpenAI API², etc.) to provide LLM/VLM inference services.

Evaluation Framework (Client)

This repo provides the evaluation framework, i.e. the client side of the project. To set up the evaluation framework, you can use uv (recommended) or pip:

uv venv
uv sync
uv run playwright install chromium

# or using pip:
pip install -e .
playwright install chromium

More on playwright Installation...

This project depends on DeOCR, which in turn depends on Playwright to do text-to-image using a browser.

Below is a copy of DeOCR's installation instruction. Please follow instructions from DeOCR whenever possible.

pip install deocr[playwright,pymupdf]
# activate your python environment, then install playwright deps
playwright install chromium

If you have trouble installing playwright, or have host-switching problems (e.g., slurm), we suggest a hacky fix like this:

# put libasound.so.2 file (a fake one is also fine) in $HOME/.local/lib
# and then export lib path for playwright to find it:
export LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.local/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.local/lib

We provide ready-to-use shell/slurm scripts for parallel evaluation in the slurm/ folder that are equivalent to the following.

VTCBench evaluation:

uv run examples/run.py \
  --model config/model/qwen_2.5_vl_7b.json \
  --data config/data/nolima.json \
  --data.context_length 1000 \
  --render config/render/default.yml \
  # --run.num_tasks 1 # for smoke test

VTCBench-Wild evaluation:

# no rendering and context length params, because they come from -wild dataset
uv run examples/run_wild.py \
  --model config/model/qwen_2.5_vl_7b.json \
  --data.path MLLM/VTCBench \
  --data.split Retrieval \
  # --run.num_tasks 1 # for smoke test

Collect results by running uv run examples/collect.py results, or uv run examples/collect.py /path/to/results/. This will print a table like below:

                     contains_all  ROUGE-L  json_id
render_css model_id                                
           Qwen3-8B         99.38    74.35      800

vLLM Serving

To setup a vLLM serving endpoint, please refer to the vLLM Serving Documentation¹.

A simple example to get you started, using deps from pyproject.toml:

# mkdir ../vllm-0.11
# set up a vllm environment seperately, parallel to this repo.
uv venv
uv add vllm==0.11.0 # optionally flash-attn https://github.com/Dao-AILab/flash-attention
# serve your model
vllm serve Qwen/Qwen3-VL-2B-Instruct --port 8001
# to test your endpoint
curl http://localhost:8001/v1/models

Known Dependency Constraints

Following are our dependency recommendations for known models to avoid potential issues. Upgrade or downgrade with caution.

Model Name	Dependency
Qwen3-VL Series	`vllm==0.11.0, transformers==4.57.1`
moonshotai/Kimi-VL-A3B-Instruct	`vllm==0.9.2, transformers<4.54`
InternVL3.5 Series	`vllm==0.10.1.1, transformers==4.57.1`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Evaluation Guide

Evaluation Framework (Client)

vLLM Serving

Known Dependency Constraints

FilesExpand file tree

USAGE

Directory actions

More options

Directory actions

More options

Latest commit

History

USAGE

Folders and files

parent directory

README.md

Evaluation Guide

Evaluation Framework (Client)

vLLM Serving

Known Dependency Constraints

Footnotes