| layout | page | ||
|---|---|---|---|
| title | Evaluation | ||
| permalink | /eval/ | ||
| redirect_from |
|
This project adopts a server-client architecture. We require a running OpenAI-compatible LLM/VLM server (e.g., vLLM Serving1, OpenAI API2, etc.) to provide LLM/VLM inference services.
This repo provides the evaluation framework, i.e. the client side of the project.
To set up the evaluation framework, you can use uv (recommended) or pip:
uv venv
uv sync
uv run playwright install chromium# or using pip:
pip install -e .
playwright install chromiumMore on playwright Installation...
This project depends on DeOCR, which in turn depends on Playwright to do text-to-image using a browser.
Below is a copy of DeOCR's installation instruction. Please follow instructions from DeOCR whenever possible.
pip install deocr[playwright,pymupdf]
# activate your python environment, then install playwright deps
playwright install chromiumIf you have trouble installing playwright, or have host-switching problems (e.g., slurm), we suggest a hacky fix like this:
# put libasound.so.2 file (a fake one is also fine) in $HOME/.local/lib
# and then export lib path for playwright to find it:
export LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.local/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.local/libWe provide ready-to-use shell/slurm scripts for parallel evaluation in
the slurm/ folder that are equivalent to the following.
VTCBench evaluation:
uv run examples/run.py \
--model config/model/qwen_2.5_vl_7b.json \
--data config/data/nolima.json \
--data.context_length 1000 \
--render config/render/default.yml \
# --run.num_tasks 1 # for smoke testVTCBench-Wild evaluation:
# no rendering and context length params, because they come from -wild dataset
uv run examples/run_wild.py \
--model config/model/qwen_2.5_vl_7b.json \
--data.path MLLM/VTCBench \
--data.split Retrieval \
# --run.num_tasks 1 # for smoke testCollect results by running uv run examples/collect.py results, or
uv run examples/collect.py /path/to/results/.
This will print a table like below:
contains_all ROUGE-L json_id
render_css model_id
Qwen3-8B 99.38 74.35 800
To setup a vLLM serving endpoint, please refer to the vLLM Serving Documentation1.
A simple example to get you started, using deps from pyproject.toml:
# mkdir ../vllm-0.11
# set up a vllm environment seperately, parallel to this repo.
uv venv
uv add vllm==0.11.0 # optionally flash-attn https://github.com/Dao-AILab/flash-attention
# serve your model
vllm serve Qwen/Qwen3-VL-2B-Instruct --port 8001
# to test your endpoint
curl http://localhost:8001/v1/modelsFollowing are our dependency recommendations for known models to avoid potential issues. Upgrade or downgrade with caution.
| Model Name | Dependency |
|---|---|
| Qwen3-VL Series | vllm==0.11.0, transformers==4.57.1 |
| moonshotai/Kimi-VL-A3B-Instruct | vllm==0.9.2, transformers<4.54 |
| InternVL3.5 Series | vllm==0.10.1.1, transformers==4.57.1 |