The InstructBLIP version we use in HA-DPO is InstructBLIP-13b (based on vicuna-13b-v1.1). Before training, prepare following:
- Language Model. Download vicuna-13b-v1.1
# download command
from huggingface_hub import snapshot_download
snapshot_download(repo_id="lmsys/vicuna-13b-v1.1")
If you download language model weights to a user-specified path using git lfs
rather than huggingface provided download API (such as from_pretrained
or snapshot_download
), there are few things you should notice:
-
specify the model path of
llm_model
inha_dpo/models/instructblip/vigc/projects/ha-dpo/instruct_vicuna13b.yaml
. -
when passing path of language model weights in training or evaluation commands, use path to your downloaded model instead of
lmsys/vicuna-13b-v1.1
.
- Pre-trained InstructBLIP-13B weight. Download pretrained InstructBLIP-13B checkpoint instruct_blip_vicuna13b_trimmed.pth. Put downloaded checkpoint under
ha_dpo/models/instructblip
. (this model weights are used under LICENSE)
follow instructions in data preparation for InstructBLIP data preparation.
run accelerate config
, default we use:
gpus = 8
bf16 = True
We use LoRA adapters to fine-tune the language model of InstructBLIP to train the model. Trainable parameters are ["q_proj","k_proj","v_proj"]
. 8 A100 GPUs are used during fine-tuning.
Training command:
accelerate launch --main_process_port $RANDOM ha_dpo/models/instructblip/train_dpo.py \
--lora_r 64 \
--cfg_path ha_dpo/models/instructblip/vigc/projects/ha-dpo/instruct_vicuna13b.yaml \
--pope_train_data_path ha_dpo/data/hadpo/instructblip/pope_data.json \
--desc_train_data_path ha_dpo/data/hadpo/instructblip/desc_data.json \
--vg_path ha_dpo/data/VG \
--gradient_checkpointing False \
--num_train_epoch 1 \
--run_name "instructblip" \
--gradient_accumulation_steps 4 \
--learning_rate 4e-6 \
--warmup_steps 0 \
--per_device_train_batch_size 1 \
--output_dir 'ha_dpo/models/instructblip/vigc/output/{model_name}' \
--logging_steps 4
default parameters are as follows:
epoch | learning rate | lora_r | lora_alpha | beta |
---|---|---|---|---|
1 | 4e-6 | 64 | 16 | 0.1 |
Merge LoRA adapters into language model:
python ha_dpo/models/instructblip/merge_peft_adapter.py \
--adapter_model_name ha_dpo/models/instructblip/vigc/output/{model_name} \
--base_model_name lmsys/vicuna-13b-v1.1 \
--output_name {path_to_merged_llm}
--adapter_model_name
: path to the saved adapter weights during training.--output_name
: path where the merged language model weights are saved.
to reproduce results, use trained adapter weights and set --adapter_model_name juliozhao/hadpo-instructblip
.
Run the following command to evaluate on SHR:
python ha_dpo/models/instructblip/shr_eval.py \
--api-key {openai_apikey} \
--cfg-path ha_dpo/models/instructblip/vigc/projects/ha-dpo/instruct_vicuna13b.yaml \
--llm-model {path_to_merged_llm} \
--vg-path ha_dpo/data/VG \
--shr-path ha_dpo/data/shr
--api-key
: SHR evaluation relies on GPT-4. Provide the openai key, begin withsk
.--llm-model
: path to the the merge language model weight.
After evaluation is finished, results are saved in ha_dpo/models/instructblip/shr_eval_results/{localtime}/metrics.json
.
judgement.json
: detailed judgements in SHR evaluation.metrics.json
: detailed metrics in SHR evaluation.mean_hal_ratio
indicates the ration of hallucinated sentences, which is the main SHR result.
SHR results
Model | HA-DPO | SHR |
---|---|---|
InstructBLIP-13B | ✖️ | 51.2 |
InstructBLIP-13B | ✔️ | 49.1 |
step 1. Firstly, inference answers using:
torchrun --nproc_per_node {NGPUs} ha_dpo/models/instructblip/pope_eval.py \
--cfg-path ha_dpo/models/instructblip/vigc/projects/ha-dpo/instruct_vicuna13b.yaml \
--llm-model {path_to_merged_llm} \
--set {random/popular/adv} \
--pope-path ha_dpo/data/POPE \
--coco-path ha_dpo/data/coco2014
--set
: validation sets in POPE, chooserandom/popular/adv
. After inference, the answer file will be generated under the folder of InstructBLIP.--model-path
: path to the the merged language model weights.
step 2. Set the answer file and the label file in ha_dpo/data/POPE/evaluate.py
.
Set ans_file
to the path of the generated answer file in step 1, and set label_file
to the path of label files under ha_dpo/data/POPE/output/coco
.
step 3. Evaluate.
run python ha_dpo/data/POPE/evaluate.py
to get results.
POPE results
POPE Random
Model | HA-DPO | Accuracy | Precision | Recall | F1 Score | Yes Ratio (%) |
---|---|---|---|---|---|---|
InstructBLIP-13B | ✖️ | 88.70 | 85.03 | 93.93 | 89.26 | 55.23 |
InstructBLIP-13B | ✔️ | 89.83 | 93.07 | 86.06 | 89.43 | 46.23 |
POPE Popular
Model | HA-DPO | Accuracy | Precision | Recall | F1 Score | Yes Ratio (%) |
---|---|---|---|---|---|---|
InstructBLIP-13B | ✖️ | 81.36 | 75.06 | 93.93 | 83.44 | 62.56 |
InstructBLIP-13B | ✔️ | 85.76 | 85.55 | 86.06 | 85.80 | 50.03 |
POPE Adversarial
Model | HA-DPO | Accuracy | Precision | Recall | F1 Score | Yes Ratio (%) |
---|---|---|---|---|---|---|
InstructBLIP-13B | ✖️ | 74.50 | 67.64 | 93.93 | 78.64 | 69.43 |
InstructBLIP-13B | ✔️ | 80.70 | 77.72 | 86.06 | 81.68 | 55.36 |
⚠️ NOTICE:
- The optimal parameters can differ according to the environment of your machine, you can adjust these parameters according to the behavior of LVLM.
- For baseline model results, don't set
--llm-model
in the evaluation command.