The LLaVA version we use in HA-DPO is LLaVA-v1.5-7b. Before training, prepare following:
- Fine-tuned LLaVA model. Download liuhaotian/llava-v1.5-7b.
# download command
from huggingface_hub import snapshot_download
snapshot_download(repo_id="liuhaotian/llava-v1.5-7b")
If you download language model weights to a user-specified path using git lfs
rather than huggingface provided download API (such as from_pretrained
or snapshot_download
), replace all liuhaotian/llava-v1.5-7b
to path of your downloaded model in training and evaluation.
follow instructions in data preparation for LLaVA-1.5 data preparation.
We use LoRA adapters to fine-tune the language model of LLaVA-1.5 to train the model. Following training settings in LLaVA, all linear layers in the model are set to trainable. 8 A100 GPUs are used during fine-tuning.
Training command
deepspeed ha_dpo/models/llava-v1_5/train_dpo.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 0 \
--deepspeed ha_dpo/models/llava-v1_5/scripts/zero3.json \
--model_name_or_path liuhaotian/llava-v1.5-7b \
--version v1 \
--vg_path ha_dpo/data/VG \
--desc_data_path ha_dpo/data/hadpo/llava-v1.5/desc_data.json \
--pope_data_path ha_dpo/data/hadpo/llava-v1.5/pope_data.json \
--vision_tower openai/clip-vit-large-patch14-336 \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir ha_dpo/models/llava-v1_5/checkpoints/{model_name} \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-6 \
--weight_decay 0. \
--warmup_steps 0 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb \
--run_name "llava-v1.5" \
--beta 0.1
default parameters are as follows:
epochs | learning rate | lora_r | lora_alpha | beta |
---|---|---|---|---|
1 | 2e-6 | 128 | 256 | 0.1 |
Run the following command to evaluate on SHR:
python ha_dpo/models/llava-v1_5/shr_eval.py \
--api-key {openai_apikey} \
--vg-path ha_dpo/data/VG \
--shr-path ha_dpo/data/shr \
--model-base liuhaotian/llava-v1.5-7b \
--model-path ha_dpo/models/llava-v1_5/checkpoints/{model_name}
--api-key
: SHR evaluation relies on GPT-4. Provide the openai key, begin withsk
.--model-path
: path to the the trained adapter weights.
After evaluation is finished, results are saved in ha_dpo/models/llava-v1_5/shr_eval_results/{localtime}/metrics.json
.
judgement.json
: detailed judgements in SHR evaluation.metrics.json
: detailed metrics in SHR evaluation.mean_hal_ratio
indicates the ration of hallucinated sentences, which is the main SHR result.
to reproduce results, use trained adapter weights and set --model-path juliozhao/hadpo-llava-1.5
.
SHR results
Model | HA-DPO | SHR |
---|---|---|
LLaVA-1.5 | ✖️ | 36.7 |
LLaVA-1.5 | ✔️ | 34.0 |
step 1. Firstly, inference answers using:
torchrun --nproc_per_node {NGPUS} --master_port $RANDOM ha_dpo/models/llava-v1_5/pope_eval.py \
--coco_path ha_dpo/data/coco2014 \
--pope_path ha_dpo/data/POPE \
--model-path ha_dpo/models/llava-v1_5/checkpoints/{model_name} \
--model-base liuhaotian/llava-v1.5-7b \
--set {random/popular/adv}
--set
: validation sets in POPE, chooserandom/popular/adv
. After inference, the answer file will be generated under the folder of LLaVA.--model-path
: path to the the trained adapter weights.
to reproduce results, use trained adapter weights and set --model-path juliozhao/hadpo-llava-1.5
.
step 2. Set the path of answer file and the label file in ha_dpo/data/POPE/evaluate.py
.
Set ans_file
to the path of the generated answer file in step 1, and set label_file
to the path of label files under ha_dpo/data/POPE/output/coco
.
step 3. Evaluate.
run python ha_dpo/data/POPE/evaluate.py
to get results.
POPE results
POPE Random
Model | HA-DPO | Accuracy | Precision | Recall | F1 Score | Yes Ratio (%) |
---|---|---|---|---|---|---|
LLaVA-1.5 | ✖️ | 89.60 | 88.77 | 90.66 | 89.70 | 51.06 |
LLaVA-1.5 | ✔️ | 90.53 | 92.99 | 87.66 | 90.25 | 47.13 |
POPE Popular
Model | HA-DPO | Accuracy | Precision | Recall | F1 Score | Yes Ratio (%) |
---|---|---|---|---|---|---|
LLaVA-1.5 | ✖️ | 86.20 | 83.23 | 90.66 | 86.79 | 54.46 |
LLaVA-1.5 | ✔️ | 87.90 | 88.07 | 87.66 | 87.81 | 49.76 |
POPE Adversarial
Model | HA-DPO | Accuracy | Precision | Recall | F1 Score | Yes Ratio (%) |
---|---|---|---|---|---|---|
LLaVA-1.5 | ✖️ | 79.76 | 74.43 | 90.66 | 81.75 | 60.90 |
LLaVA-1.5 | ✔️ | 81.46 | 77.99 | 87.66 | 82.54 | 56.20 |
⚠️ NOTICE:
- The optimal parameters can differ according to the environment of your machine, you can adjust these parameters according to the behavior of LVLM.
- For baseline LLaVA-1.5 results, don't set
--model-base
and set--model-path
toliuhaotian/llava-v1.5-7b
in the evaluation command.