[Feature] Support LLaVA (#196)

* v1 * add load_image * update cfg image url * del fig * update * temp * update convert * update chat_mm * add exclude_frozen_parameters for deepspeed * update chat * update xtuner help msg * fix bugs * revert bf16 deepspeed * fix bugs * add visual_select_layer for chat * improve pth_to_hf * rename projecter_pth to pretrained_pth * temp * update requirements * add cfgs * update * fix pre-commit * optim chat * optim chat * Delete xtuner/model/unused.py * move dispatch to a deeper folder * add projector * update * del model/projector * fix bugs * add docs * update * update * update * update * enhance resume for map_fn * update import * add llava_internlm_chat_7b_clip_vit_large_p14 * update dispatch * update dispatch * add link * update max_length * update max_length * update hyp * align * move yi flash attn * fix pre-commit * update deepspeed requirements * add mmbench script * install openpyxl * add entry_point for mmbench * save args * update mmbench * update max_length * add llama2 qlora * update mmbench * fix mmbench bugs * use osp instead of os.path * refactor pth_to_hf * update chat and mmbench to support --llava * align to chat * update entry_point * add vicuna template * add vicuna_7b_v15 * fix pre-commit * add vicuna_7b_v1.5 qlora * skip_special_tokens for decode text * remove do_sample * add warmup * fix pre-commit * Update dataset_prepare.md * Update dataset_prepare.md * Add KEEP_STSTEM for template * remove * fix vicuna template * clean cfgs * add cfgs * fix pre-commit * add --language for mmbench * fix bugs * fix pretrain bug * support visual_encoder lora * fix bugs * add paramwise_cfg * remove print_peft_model_trainable_parameters * fix bugs * add paramwise_cfg for DeepSpeedOptimWrapper * fix engine deepspeed paramwise_cfg bug * fix encode_fn bug * fix * fix pad_image_to_square bugs * Add space for system to avoid mismatch of 'USER' token * revert to adding bos_token at each conv * revert for paramwise_cfg * better cfgs? * fix import bug * fix import bug * pretrain align * update prepare_inputs_labels_for_multimodal * 1792 * support length_grouped_samplers * 1792 * remove KEEP_SYSTEM * remove system in cfg * update 336 cfg * add torch_dtype for mmbench and chat * group 50 * quant for pretrain * update cfgs * refactor cfgs * add length for concat dataset * update requirements * fix typo * add template for internlm pretrain * no zh * remove 20b cfgs * fix pre-commit * revert invalid input * rename * Update README.md * Update README_zh-CN.md * fix pre-commit * remove llava_zh from docs * qlora 512 * rename llava map_fn * update cfgs * update model urls * add docs link * add llava docs * update docs * update urls * add citation * fix README * move * update * vicuna pretrain with prompt * rename * add results * fix pre-commit * update * update * update * update * update * update * update * update * update * update * update * update * Update README.md * Update README_zh-CN.md * Update README_zh.md * Update README_zh.md * Update README.md * Update README_zh.md * Update README.md * Update README.md * fix typo * fix * Update README.md * Update README_zh-CN.md * rename * auto cn_string * fix pre-commit * rename * remove language * add VLMEvalKit * rename VLLM to VLM * add the download links of MMBench * update * update readme * update * update * update merge * fix cfg bug * Update README.md * Update README_zh.md * update * fix * update requirements * Update runtime.txt * Update runtime.txt * Update runtime.txt * Update README.md * Update README.md * Update README_zh.md * fix pre-commit * fix * update mmbench prompt * fix bugs * fix bugs * update docs * update * update * Update README.md
InternLM · Dec 26, 2023 · 6b962e6 · 6b962e6
1 parent e7348af
commit 6b962e6
Show file tree

Hide file tree

Showing 57 changed files with 4,014 additions and 272 deletions.
diff --git a/README.md b/README.md
@@ -23,7 +23,8 @@ English | [简体中文](README_zh-CN.md)
 
 ## 🎉 News
 
-- **\[2023/12\]** 🔥 Support [Mixtral 8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model! To get started, please check out the [docs](xtuner/configs/mixtral/README.md)!
+- **\[2023/12\]** 🔥 Support multi-modal VLM pretraining and fine-tuning with [LLaVA-v1.5](https://github.com/haotian-liu/LLaVA) architecture! Click [here](xtuner/configs/llava/README.md) for details!
+- **\[2023/12\]** 🔥 Support [Mixtral 8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model! Click [here](xtuner/configs/mixtral/README.md) for details!
 - **\[2023/11\]** Support [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b) model!
 - **\[2023/10\]** Support [MSAgent-Bench](https://modelscope.cn/datasets/damo/MSAgent-Bench) dataset, and the fine-tuned LLMs can be applied by [Lagent](https://github.com/InternLM/lagent)!
 - **\[2023/10\]** Optimize the data processing to accommodate `system` context. More information can be found on [Docs](docs/en/user_guides/dataset_format.md)!
@@ -267,6 +268,18 @@ We appreciate all contributions to XTuner. Please refer to [CONTRIBUTING.md](.gi
 - [Llama 2](https://github.com/facebookresearch/llama)
 - [QLoRA](https://github.com/artidoro/qlora)
 - [LMDeploy](https://github.com/InternLM/lmdeploy)
+- [LLaVA](https://github.com/haotian-liu/LLaVA)
+
+## 🖊️ Citation
+
+```bibtex
+@misc{2023xtuner,
+    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
+    author={XTuner Contributors},
+    howpublished = {\url{https://github.com/InternLM/xtuner}},
+    year={2023}
+}
+```
 
 ## License
 

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -23,6 +23,7 @@
 
 ## 🎉 更新
 
+- **\[2023/12\]** 🔥 支持多模态模型 VLM（[LLaVA-v1.5](https://github.com/haotian-liu/LLaVA)）预训练和指令微调！快速开始请查阅此[文档](xtuner/configs/llava/README_zh.md)！
 - **\[2023/12\]** 🔥 支持 [Mixtral 8x7b](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) 模型！快速开始请查阅此[文档](xtuner/configs/mixtral/README.md)！
 - **\[2023/11\]** 支持 [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b) 模型！
 - **\[2023/10\]** 支持 [MSAgent-Bench](https://modelscope.cn/datasets/damo/MSAgent-Bench) 数据集，并且微调所得大语言模型可应用至 [Lagent](https://github.com/InternLM/lagent) 框架！
@@ -267,6 +268,18 @@ xtuner chat meta-llama/Llama-2-7b-hf --adapter xtuner/Llama-2-7b-qlora-moss-003-
 - [Llama 2](https://github.com/facebookresearch/llama)
 - [QLoRA](https://github.com/artidoro/qlora)
 - [LMDeploy](https://github.com/InternLM/lmdeploy)
+- [LLaVA](https://github.com/haotian-liu/LLaVA)
+
+## 🖊️ 引用
+
+```bibtex
+@misc{2023xtuner,
+    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
+    author={XTuner Contributors},
+    howpublished = {\url{https://github.com/InternLM/xtuner}},
+    year={2023}
+}
+```
 
 ## 开源许可证
 

diff --git a/docs/en/user_guides/dataset_prepare.md b/docs/en/user_guides/dataset_prepare.md
@@ -5,6 +5,7 @@
   - [Arxiv Gentitle](#arxiv-gentitle)
   - [MOSS-003-SFT](#moss-003-sft)
   - [Chinese Lawyer](#chinese-lawyer)
+  - [LLaVA dataset](#llava-dataset)
 
 ## HuggingFace datasets
 
@@ -55,3 +56,78 @@ unzip moss-003-sft-with-tools-no-text2image.zip
 Chinese Lawyer dataset has two sub-dataset, and can be downloaded form https://github.com/LiuHC0428/LAW-GPT.
 
 All lawyer configs assume the dataset path to be `./data/CrimeKgAssitant清洗后_52k.json` and `./data/训练数据_带法律依据_92k.json`. You can move and rename your data, or make changes to these configs.
+
+### LLaVA dataset
+
+#### File structure
+
+```
+./data/llava_data
+├── LLaVA-Pretrain
+│   ├── blip_laion_cc_sbu_558k.json
+│   ├── blip_laion_cc_sbu_558k_meta.json
+│   └── images
+├── LLaVA-Instruct-150K
+│   └── llava_v1_5_mix665k.json
+└── llava_images
+    ├── coco
+    │   └── train2017
+    ├── gqa
+    │   └── images
+    ├── ocr_vqa
+    │   └── images
+    ├── textvqa
+    │   └── train_images
+    └── vg
+        ├── VG_100K
+        └── VG_100K_2
+```
+
+#### Pretrain
+
+LLaVA-Pretrain
+
+```shell
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
+```
+
+#### Finetune
+
+1. Text data
+
+   1. LLaVA-Instruct-150K
+
+      ```shell
+      # Make sure you have git-lfs installed (https://git-lfs.com)
+      git lfs install
+      git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
+      ```
+
+2. Image data
+
+   1. COCO (coco): [train2017](http://images.cocodataset.org/zips/train2017.zip)
+
+   2. GQA (gqa): [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
+
+   3. OCR-VQA (ocr_vqa): [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)
+
+      1. ⚠️ Modify the name of OCR-VQA's images to keep the extension as `.jpg`!
+
+         ```shell
+         #!/bin/bash
+         ocr_vqa_path="<your-directory-path>"
+
+         find "$target_dir" -type f | while read file; do
+             extension="${file##*.}"
+             if [ "$extension" != "jpg" ]
+             then
+                 cp -- "$file" "${file%.*}.jpg"
+             fi
+         done
+         ```
+
+   4. TextVQA (textvqa): [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
+
+   5. VisualGenome (VG): [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
diff --git a/docs/zh_cn/user_guides/dataset_prepare.md b/docs/zh_cn/user_guides/dataset_prepare.md
@@ -5,6 +5,7 @@
   - [Arxiv Gentitle 生成题目](#arxiv-gentitle-生成题目)
   - [MOSS-003-SFT](#moss-003-sft)
   - [Chinese Lawyer](#chinese-lawyer)
+  - [LLaVA dataset](#llava-dataset)
 
 ## HuggingFace 数据集
 
@@ -55,3 +56,78 @@ unzip moss-003-sft-with-tools-no-text2image.zip
 Chinese Lawyer 数据集有两个子数据集，它们可以在 https://github.com/LiuHC0428/LAW-GPT 下载。
 
 所有的 Chinese Lawyer 配置文件都假设数据集路径为 `./data/CrimeKgAssitant清洗后_52k.json` 和 `./data/训练数据_带法律依据_92k.json`。用户可以移动并重命名数据，或者在配置文件中重新设置数据路径。
+
+### LLaVA dataset
+
+#### 文件结构
+
+```
+./data/llava_data
+├── LLaVA-Pretrain
+│   ├── blip_laion_cc_sbu_558k.json
+│   ├── blip_laion_cc_sbu_558k_meta.json
+│   └── images
+├── LLaVA-Instruct-150K
+│   └── llava_v1_5_mix665k.json
+└── llava_images
+    ├── coco
+    │   └── train2017
+    ├── gqa
+    │   └── images
+    ├── ocr_vqa
+    │   └── images
+    ├── textvqa
+    │   └── train_images
+    └── vg
+        ├── VG_100K
+        └── VG_100K_2
+```
+
+#### 预训练 Pretrain
+
+LLaVA-Pretrain
+
+```shell
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
+```
+
+#### 微调 Finetune
+
+1. 文本数据
+
+   1. LLaVA-Instruct-150K
+
+      ```shell
+      # Make sure you have git-lfs installed (https://git-lfs.com)
+      git lfs install
+      git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
+      ```
+
+2. 图片数据
+
+   1. COCO (coco): [train2017](http://images.cocodataset.org/zips/train2017.zip)
+
+   2. GQA (gqa): [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
+
+   3. OCR-VQA (ocr_vqa): [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)
+
+      1. ⚠️ OCR-VQA 所下载的图片命名需要进行修改，以确保所有图片后缀为 `.jpg`！
+
+         ```shell
+         #!/bin/bash
+         ocr_vqa_path="<your-directory-path>"
+
+         find "$target_dir" -type f | while read file; do
+             extension="${file##*.}"
+             if [ "$extension" != "jpg" ]
+             then
+                 cp -- "$file" "${file%.*}.jpg"
+             fi
+         done
+         ```
+
+   4. TextVQA (textvqa): [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
+
+   5. VisualGenome (VG): [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
diff --git a/requirements/runtime.txt b/requirements/runtime.txt
@@ -8,6 +8,7 @@ lagent>=0.1.2
 # Minimum 0.10.1 to support exclude_frozen_parameters for DeepSpeedStrategy,
 # see https://github.com/open-mmlab/mmengine/pull/1415, https://github.com/open-mmlab/mmengine/pull/1424
 mmengine>=0.10.1
+openpyxl
 # Minimum 0.4.0 to support QLoRA, see https://github.com/huggingface/peft/pull/476
 peft>=0.4.0
 scipy

diff --git a/xtuner/configs/internlm/internlm_7b/internlm_7b_full_intern_repo_dataset_template.py b/xtuner/configs/internlm/internlm_7b/internlm_7b_full_intern_repo_dataset_template.py
@@ -100,7 +100,9 @@
 #######################################################################
 # Log the dialogue periodically during the training process, optional
 custom_hooks = [
-    dict(type=DatasetInfoHook, tokenizer=tokenizer),
+    dict(
+        type=DatasetInfoHook, tokenizer=tokenizer,
+        is_intern_repo_dataset=True),
     dict(type=ThroughputHook)
 ]
 

diff --git a/xtuner/configs/llava/README.md b/xtuner/configs/llava/README.md
@@ -0,0 +1,92 @@
+# LLaVA Full Pipeline
+
+## Data Preparation
+
+Please refer to the [docs](../../../docs/en/user_guides/dataset_prepare.md#llava-dataset).
+
+## Training
+
+The training of LLaVA consists of two steps: alignment module (i.e., MLP) pretraining and instruction following fine-tuning
+
+Note: this guide takes 8-card training LLaVA-InternLM as an example, if there are insufficient GPU resources or memory during actual use, you can reduce the batchsize appropriately to decrease memory consumption. The Pretrained projector is saved and re-loaded by default in `./work_dirs/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/epoch_1.pth`.
+
+1. Alignment module pretraining (saved by default in `./work_dirs/`)
+
+```bash
+NPROC_PER_NODE=8 xtuner train llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2
+```
+
+2. Instruction following fine-tuning (saved by default in `./work_dirs/`)
+
+```bash
+NPROC_PER_NODE=8 xtuner train llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2
+```
+
+## Model Convert (and Merge)
+
+After training, we will obtain a set of weights (*i.e.*, `epoch_1.pth`), which are not in the universal HuggingFace format. We first need to convert them.
+
+```bash
+xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
+# e.g., xtuner convert pth_to_hf llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune ./epoch_1.pth ./epoch_1_hf
+```
+
+At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
+
+Afterwards, if you want to merge LoRA into LLM or CLIP-ViT, please use the following command:
+
+```bash
+(LLM) xtuner convert merge $LLM $LLM_ADAPTER $SAVE_PATH
+(CLIP) xtuner convert merge $CLIP $CLIP_ADAPTER $SAVE_PATH --is-clip
+```
+
+## Chat
+
+You can download the released LLaVA-InternLM-7B model from 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b) and 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b), and achieve image-text question answering with the following command!
+
+```bash
+xtuner chat internlm/internlm-chat-7b \
+  --visual-encoder openai/clip-vit-large-patch14-336 \
+  --llava xtuner/llava-internlm-7b \
+  --prompt-template internlm_chat \
+  --image $IMAGE_PATH
+```
+
+Here, `--llava` is the converted weight from the above step (in our example, it is `./epoch_1_hf` ).
+
+## Evaluation
+
+XTuner's LLaVA models can be evaluated using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit).
+
+For convenience, XTuner also integrates the [MMBench](https://mmbench.opencompass.org.cn/home) evaluation.
+
+User can download the MMBench dataset with
+
+```
+wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
+wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
+wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
+wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
+wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
+```
+
+After that, the evaluations can be run with
+
+```bash
+xtuner mmbench internlm/internlm-chat-7b \
+  --visual-encoder openai/clip-vit-large-patch14-336 \
+  --llava xtuner/llava-internlm-7b \
+  --prompt-template internlm_chat \
+  --data-path $DATA_PATH \
+  --work-dir $RESULT_PATH
+```
+
+Here, `$DATA_PATH` refers to one of the datasets downloaded as mentioned above, such as `MMBench_DEV_EN.tsv`.
+
+After the evaluation is completed, if it's a development set, it will directly print out the results; If it's a test set, you need to submit `mmbench_result.xlsx` to the official MMBench for final evaluation to obtain precision results!
+
+| Model                      | MMBench Test (EN) | MMBench Dev (EN) | MMBench Test (CN) | MMBench Dev (CN) | CCBench Dev | MME  | MMVet | SEEDBench_IMG |                                                                                                                                     Configs                                                                                                                                     |                                                                   Pretrained Projector Checkpoints                                                                   |                                                            Fine-tuned LLaVA Checkpoints                                                            |
+| :------------------------- | :---------------: | :--------------: | :---------------: | :--------------: | :---------: | :--: | :---: | :-----------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------: |
+| LLaVA-v1.5-7B (XTuner)     |       67.7        |       69.2       |       61.0        |       59.7       |    27.6     | 1702 | 66.4  |     32.3      |       [Pretrain](./vicuna_7b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./vicuna_7b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py)       |  🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-7b-xtuner-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-7b-xtuner-pretrain)  |  🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-7b-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-7b-xtuner)  |
+| LLaVA-v1.5-13B (XTuner)    |       68.9        |       69.5       |       64.7        |       63.1       |    32.2     | 1771 | 68.1  |     35.5      |     [Pretrain](./vicuna_13b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./vicuna_13b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py)     | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-13b-xtuner-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-13b-xtuner-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-13b-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-13b-xtuner) |
+| LLaVA-InternLM-7B (XTuner) |       69.0        |       68.5       |       66.7        |       63.8       |    35.8     | 1671 | 65.8  |     33.8      | [Pretrain](./internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./internlm_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) |     🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b-pretrain)     |     🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b)     |