Monkey/data_generation at main · Yuliang-Liu/Monkey

readme.md

The requested images should be placed in the "./images" directory, and the results will be stored in the "./outputs" directory.

Download GRiT(Dense Captioning on VG Dataset) and place it under ./grit/model_weight.

Download SAM and place it under ./model_weight.

Generation Steps:

Generate global description for each image. python blip2.py
Use the Grit model to generate dense captions for each image. python grit_generate.py
Generate segmentation maps for each image using the SAM model, and save the segmentation maps in the "./masks" directory. python amg.py --checkpoint ./model_weight/<pth name> --model-type <model_type> --input ./images --output ./masks --convert-to-rle
Generate corresponding descriptions for the segmentation maps. python sam_blip.py
Compute the similarity score. python image_text_matching.py --ann_path ./outputs/sam_blip2.json --output_path ./outputs/sam_blip2_score.json
Compute the similarity score. python image_text_matching.py --ann_path ./outputs/grit.json --output_path ./outputs/grit_score.json
Use ppocr to detect text in images.
python ocr_ppocr.py
Integrate the generated annotations into ann_all.json.
python add_all_json.py
Use ChatGPT API to generate the final detailed description and save it in ./outputs/ann_all.json.
python chatgpt.py