Using the cold start dataset provided by Ares for SFT training, the MathVision evaluation score is only 29.

Thank you for your contribution; this is an outstanding piece of work. I encountered a slight issue during the cold start reproduction phase and would like to seek assistance.
My training configuration is as follows:
`### model
model_name_or_path: ./Qwen/Qwen2.5-VL-7B-Instruct
image_max_pixels: 2007040
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: ./LLaMA-Factory/examples/deepspeed/ds_z3_config.json


### dataset
dataset_dir: ./huggingface.co/datasets/ares_sft
dataset: filter_data_final
template: qwen2_vl
cutoff_len: 32768
max_samples: null
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: ./checkpoints/llama_factory/ares_coldstart_big_image_filtered
logging_steps: 10
save_steps: 10
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 2.0e-5
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.05
bf16: true
gradient_checkpointing: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### special_tokens
add_tokens: "<think>,</think>,<answer>,</answer>"
skip_special_tokens: false
resize_vocab: true`

My training logs are as follows:

<img width="2382" height="817" alt="Image" src="https://github.com/user-attachments/assets/39c64bca-e6f4-49ed-b080-e62a8baae615" />

When I evaluated on the MathVision dataset, I used GPT-4o-mini for assessment and enabled the prefetch function. _**_However, the accuracy rate was only 0.29,_**_ which shows a significant gap compared to the results in the original paper. I would like to ask if there might be any issues in my reproduction process.

I would be very grateful if you could help me resolve this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the cold start dataset provided by Ares for SFT training, the MathVision evaluation score is only 29. #2

method

dataset

output

train

special_tokens

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Using the cold start dataset provided by Ares for SFT training, the MathVision evaluation score is only 29. #2

Description

method

dataset

output

train

special_tokens

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions