Thank you for your contribution; this is an outstanding piece of work. I encountered a slight issue during the cold start reproduction phase and would like to seek assistance.
My training configuration is as follows:
`### model
model_name_or_path: ./Qwen/Qwen2.5-VL-7B-Instruct
image_max_pixels: 2007040
video_max_pixels: 16384
trust_remote_code: true
method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: ./LLaMA-Factory/examples/deepspeed/ds_z3_config.json
dataset
dataset_dir: ./huggingface.co/datasets/ares_sft
dataset: filter_data_final
template: qwen2_vl
cutoff_len: 32768
max_samples: null
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
output
output_dir: ./checkpoints/llama_factory/ares_coldstart_big_image_filtered
logging_steps: 10
save_steps: 10
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 2.0e-5
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.05
bf16: true
gradient_checkpointing: true
ddp_timeout: 180000000
resume_from_checkpoint: null
special_tokens
add_tokens: ",,,"
skip_special_tokens: false
resize_vocab: true`
My training logs are as follows:
When I evaluated on the MathVision dataset, I used GPT-4o-mini for assessment and enabled the prefetch function. However, the accuracy rate was only 0.29, which shows a significant gap compared to the results in the original paper. I would like to ask if there might be any issues in my reproduction process.
I would be very grateful if you could help me resolve this problem.
Thank you for your contribution; this is an outstanding piece of work. I encountered a slight issue during the cold start reproduction phase and would like to seek assistance.
My training configuration is as follows:
`### model
model_name_or_path: ./Qwen/Qwen2.5-VL-7B-Instruct
image_max_pixels: 2007040
video_max_pixels: 16384
trust_remote_code: true
method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: ./LLaMA-Factory/examples/deepspeed/ds_z3_config.json
dataset
dataset_dir: ./huggingface.co/datasets/ares_sft
dataset: filter_data_final
template: qwen2_vl
cutoff_len: 32768
max_samples: null
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
output
output_dir: ./checkpoints/llama_factory/ares_coldstart_big_image_filtered
logging_steps: 10
save_steps: 10
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 2.0e-5
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.05
bf16: true
gradient_checkpointing: true
ddp_timeout: 180000000
resume_from_checkpoint: null
special_tokens
add_tokens: ",,,"
skip_special_tokens: false
resize_vocab: true`
My training logs are as follows:
When I evaluated on the MathVision dataset, I used GPT-4o-mini for assessment and enabled the prefetch function. However, the accuracy rate was only 0.29, which shows a significant gap compared to the results in the original paper. I would like to ask if there might be any issues in my reproduction process.
I would be very grateful if you could help me resolve this problem.