The requested images should be placed in the "./images" directory, and the results will be stored in the "./outputs" directory.
Follow https://github.com/salesforce/LAVIS, https://github.com/facebookresearch/segment-anything, https://github.com/JialianW/GRiT and https://github.com/PaddlePaddle/PaddleOCR.git to prepare the enverionment.
Download GRiT(Dense Captioning on VG Dataset) and place it under ./grit/model_weight.
Download SAM and place it under ./model_weight.
Generation Steps:
-
Generate global description for each image.
python blip2.py
-
Use the Grit model to generate dense captions for each image.
python grit_generate.py
-
Generate segmentation maps for each image using the SAM model, and save the segmentation maps in the "./masks" directory.
python amg.py --checkpoint ./model_weight/<pth name> --model-type <model_type> --input ./images --output ./masks --convert-to-rle
-
Generate corresponding descriptions for the segmentation maps.
python sam_blip.py
-
Compute the similarity score.
python image_text_matching.py --ann_path ./outputs/sam_blip2.json --output_path ./outputs/sam_blip2_score.json
-
Compute the similarity score.
python image_text_matching.py --ann_path ./outputs/grit.json --output_path ./outputs/grit_score.json
-
Use ppocr to detect text in images.
python ocr_ppocr.py
-
Integrate the generated annotations into ann_all.json.
python add_all_json.py
-
Use ChatGPT API to generate the final detailed description and save it in ./outputs/ann_all.json.
python chatgpt.py