Improving Chinese-CLIP Image Retrieval Ability Using Knowledge Distillation

Here we provide an example of knowledge distillation for Chinese-CLIP fine-tuning training, based on ModelScope model library. By using knowledge distillation, smaller Chinese-CLIP models (with better inference speed) can learn from larger models (including larger Chinese-CLIP or other image embedding models on ModelScope) to further improve the image-to-image retrieval ability. The Teacher models used are all from ModelScope. Currently, all the Chinese-CLIP have been supported on ModelScope.

Environmental Preparation

Nvidia GPUs with Turning, Ampere, Ada or Hopper architecture (such as H100, A100, RTX 3090, T4, and RTX 2080). Please refer to this document for the corresponding GPUs of each Nvidia architecture.
CUDA 11.4 and above.
PyTorch 1.12 and above.
ModelScope：Install ModelScope by executing pip install modelscope.
Other dependencies as required in requirements.txt.

Use it in Chinese-CLIP!

It is not complicated to apply knowledge distillation to the image side in Chinese-CLIP finetune. Just add the --distillation configuration item to the sh script of finetune. Then fill in the name of the Teacher model to be used in the configuration item --teacher-model-name. The currently supported Teacher models include the following four ModelScope-supported models.

Teacher model	Model Info
damo/multi-modal_clip-vit-huge-patch14_zh	CLIP model-Chinese-general field-huge
damo/multi-modal_clip-vit-large-patch14_zh	CLIP model-Chinese-general field-large
damo/multi-modal_team-vit-large-patch14_multi-modal-similarity	TEAM image-text retrieval model-Chinese-large
damo/multi-modal_rleg-vit-large-patch14	RLEG Generative Multimodal Representation Model-English-large

Finally, fill in the weight of the distillation loss in the configuration item --kd_loss_weight, the default value is 0.5.

The configuration items are defined as follows:

distillation: Whether to enable knowledge distillation to fine-tune the image side of the model.
teacher-model-name: Specify the Teacher model to use. Currently supports the above four Teacher models, such as filling in damo/multi-modal_team-vit-large-patch14_multi-modal-similarity.
kd_loss_weight (optional): Distillation loss weight, default value is 0.5.

We provide a sample script run_scripts/muge_finetune_vit-b-16_rbt-base_distillation.sh, we take the TEAM image-text retrieval model-Chinese-large as Teacher model.

Effect verification

Image retrieval Top10 results of our model (finetune+distillation) v.s. pre-trained model v.s. finetune model. The image in the upper left corner is used as a query, and the search results are in order from Top1 to Top10 on the right. The support data set in this experiment has 100,000 e-commerce data (including shoes, clothes, pants, etc.).

Advantages of our approach:

Meet the basic requirements of the retrieval task: under the premise of ensuring the category similarity, the image similarity is well realized.
Good performance and fast speed: Through the distillation method, the base model has a retrieval effect similar to that of the large model. And deployed to the CPU, the retrieval reasoning time is controlled within 100ms.

Quick Start

A solution of distillation have been launched on Alibaba Cloud PAI-DSW Gallery. The corresponding Jupyter Notebook is provided in PAI-DSW Gallery to support users to build exclusive search models using their own data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distillation_En.md

distillation_En.md

Improving Chinese-CLIP Image Retrieval Ability Using Knowledge Distillation

Environmental Preparation

Use it in Chinese-CLIP!

Effect verification

Quick Start

Files

distillation_En.md

Latest commit

History

distillation_En.md

File metadata and controls

Improving Chinese-CLIP Image Retrieval Ability Using Knowledge Distillation

Environmental Preparation

Use it in Chinese-CLIP!

Effect verification

Quick Start