Create dataset loader for Indo_MultiModal_LAION #308

SamuelCahyawijaya · 2022-10-02T16:07:15Z

Dataset	id_mm_laion
Description	Indo_MultiModal_LAION is a translated subset of the LAION-400M dataset with 70M image-text pairs specifically meant to be used for vision-language pre-training in Indonesian language. LAION-400M is a dataset with 400M English (image, text) pairs, filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3. The threshold of 0.3 had been determined through human evaluations and seemed to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. More info for LAION-400M: https://laion.ai/blog/laion-400-open-dataset/.
License	From LAION-400M: We distribute the metadata dataset (the parquet files) under the most open Creative Common CC-BY 4.0 license, which poses no particular restriction. The images are under their copyright.

acul3 · 2022-10-04T07:06:31Z

#self-assign

SamuelCahyawijaya added this to Nusantara Dataset Initiative Oct 2, 2022

muhsatrio added the hacktoberfest label Oct 3, 2022

github-actions bot assigned acul3 Oct 4, 2022

Provide feedback