-
Can you specify the exact clip model parameters used and how they are evaluated on each image? I am trying to search for the k closest images in your 400 million image dataset to my own dataset of images using OpenAI's ViT-B/32 clip image embedder. But I cannot seem to get my self-computed image embedding outputs to match what shows up in your precomputed embeddings here: http://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/img_emb/ e.g. in part 0 the embedding at index 100 begins as: x = np.load('img_emb_0.npy')
print(x[100,:])
[-0.003084 -0.0285 -0.02127 0.01313 0.00976 -0.0008097
0.05478 0.04456 -0.005177 -0.05322 0.03958 -0.00241
0.09863 0.02107 -0.008995 0.02281 0.0628 0.02582 ...] In the corresponding parquet file I find: df_m0 = pd.read_parquet('./metadata_0.parquet', engine='pyarrow')
df_m0.loc[100, 'url']
https://preppingsurvival.com/wp-content/uploads/2021/04/ed-food-canned-apricots-4-years-expired-how-long-does-canned-food-last-bdb2xLfDwT0sddefault-300x300.jpg If I download this image and run the following: model, preprocess = clip.load("ViT-B/32", device='cpu')
image = preprocess(Image.open("./ed-food-canned-apricots-4-years-expired-how-long-does-canned-food-last-bdb2xLfDwT0sddefault-300x300.jpg")).unsqueeze(0)
with torch.no_grad():
v = model.encode_image(image)
v = v.cpu().numpy()
v = v / np.linalg.norm(v) Then we have the first entries of v are: [-0.01340919 -0.04178299 -0.01926079 0.02442356 -0.00797546 0.01472267
0.05920784 0.01699294 -0.02090414 -0.03895969 0.05996456 0.01401705
0.10676499 0.0010551 0.02233738 -0.00436341 0.02495931 0.01011568 ...] These values do not match what was stored in the .npy file. So is there something I am doing wrong? On a side note, it would be helpful if you gave a few lines of sample code to load and query your precomputed faiss indices. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Hi, the difference you see is probably due to the resizing applied in img2dataset, we did the resizing called border in https://github.com/rom1504/img2dataset#api ; you can see the code there https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py#L98 here is how to query the index locally https://github.com/rom1504/clip-retrieval/blob/main/notebook/simple_filter.ipynb you can also query our backend directly with https://colab.research.google.com/drive/1d234Gp_7xGI5pAQ0dE71LT4rklZS_OsK#scrollTo=Xp3EBMHsMf6n |
Beta Was this translation helpful? Give feedback.
-
One more question I had pertains to the clip-retrieval inference script in webdataset mode. It seems when setting a cache_path, the files created there by the webdataset object are never deleted once they are finished being processed. It would be nice if there was some kind of cleanup step that was able to tell once all images from a given .tar file have been processed and then delete the corresponding file from the cache folder. This is only an issue if trying to evaluate a really large dataset all at once, e.g. 100+ million images, depending on size of hard drive. But maybe this is already handled using some other function arguments that I'm missing? |
Beta Was this translation helpful? Give feedback.
Hi, the difference you see is probably due to the resizing applied in img2dataset, we did the resizing called border in https://github.com/rom1504/img2dataset#api ; you can see the code there https://github.com/rom1504/img2dataset/blob/main/img2dataset/resizer.py#L98
here is how to query the index locally https://github.com/rom1504/clip-retrieval/blob/main/notebook/simple_filter.ipynb
you can also query our backend directly with https://colab.research.google.com/drive/1d234Gp_7xGI5pAQ0dE71LT4rklZS_OsK#scrollTo=Xp3EBMHsMf6n