[Project Page] [Paper] [Model🤗] [Dataset🤗]
- 2024.11.15 The code have been released.🔥🔥🔥
- 2024.11.15 The large-scale RS image-text dataset, VersaD, featuring rich and diverse captions, has just been released, now with a massive 1.4 million image captions. 🔥🔥🔥
- 2024.11.3 The new version of the VHM paper has been updated to ArXiv.
- 2024.3.28 H2RSVLM paper available on ArXiv.
- Release inference code and checkpoints of VHM.
- Release all data in this work.
- Release the training code of VHM.
Refer to the following command for installation.
git clone [email protected]:opendatalab/VHM.git
cd VHM
conda create -n vhm
conda activate vhm
pip install -r requirment.txt
You should follow this instruction Data.md to manage the datasets. If you need to train our model from scratch, please refer to for data download and preparation first.
VHM consists of a visual encoder, a projector layer, and a large language model (LLM). The visual encoder uses a pretrained CLIP-14-336px, the projector layer is composed of two MLP layers, and the LLM is based on the pretrained Vicuna-7B. The model is trained in two stages, as shown in the diagram below.
We provide not only the weights after the SFT stage but also the Pretrained weights.
Name | Description |
---|---|
VHM_sft | The LLM and MLP weights obtained from the SFT stage |
VHM_pretrain | The LLM and MLP weights obtained from the Pretraining stage. |
CLIP_pretrain | The CLIP weights obtained from the Pretraining stage. |
VHM model training consists of two stages: (1) Pretrain stage: use our VersaD dataset with 1.4M image-text pairs to finetune the vision encoder, projector, and the LLM to align the textual and visual modalities; (2) Supervised Fine-Tuning(SFT) stage: finetune the projector and LLM to teach the model to follow multimodal instructions.
First, you should download the MLP projector pretrained by LLaVA-1.5. Because a rough modality alignment process is beneficial before using high quality detailed captions for modality alignment.
You can run sh scripts/rs/slurm_pretrain.sh
to pretrain the model. Remember to specify the projector path in the script. In this stage, we fine-tuned the second half of the vision encoder's blocks, projector, and LLM.
In our setup we used 16 A100 (80G) GPUs and the whole pre-training process lasted about 10 hours. You can adjust the number of gradient accumulation steps to reduce the number of GPUs.
In the sh scripts/rs/slurm_pretrain.sh
, you need to revise three paths:
DATA_DIR=pretrain_base # directory of VersaD dataset
export LIST_FILE=${DATA_DIR}/list_pretrain.json # json file of VersaD data
export CKPT_PATH=weight_path # llava-1.5 MLP weight path
export SAVE_PATH=vhm-7b_prtrained # file save path
In this stage, we finetune the projector and LLM with our VHM_SFT dataset.
In our setup we used 8 A100 (80G) GPUs and the whole sft process lasted about 4 hours. You can adjust the number of gradient accumulation steps to reduce the number of GPUs.
You can run sh scripts/rs/slurm_finetune.sh
to finetune the model, and you need to revise three paths:
DATA_DIR=sft_base # directory of vhm-sft dataset
export LIST_FILE=${DATA_DIR}/list_sft.json # json file of sft data
CKPT=vhm-7b_pretrained # pretrain weight path
export SAVE_PATH=vhm-7b_sft # file save path
In order to facilitate the use of remote sensing vision-language large models, we have developed a specialized evaluation project RSEvalKit for remote sensing large models. Please refer to the following command for installation.
git clone https://github.com/fitzpchao/RSEvalKit
cd RSEvalKit
conda create -n rseval
conda activate rseval
pip install -r requirements.txt
All evaluation tasks for this paper are implemented in RSEval and can be evaluated with one click. First, you need to download our model weights and VHM_Eval data, then follow the instructions to complete the evaluation.
@misc{pang2024vhmversatilehonestvision,
title={VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis},
author={Chao Pang and Xingxing Weng and Jiang Wu and Jiayu Li and Yi Liu and Jiaxing Sun and Weijia Li and Shuai Wang and Litong Feng and Gui-Song Xia and Conghui He},
year={2024},
eprint={2403.20213},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2403.20213},
}
We gratefully acknowledge these wonderful works:
Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and Gemini. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.