[FEATURE] Warning about Mismatch Between similarity function of Embedding Model and Index `space_type` #2356

YeonghyeonKO · 2024-12-26T09:46:49Z

Is your feature request related to a problem?

There can be a problem when embedding vectors(ex. msmarco-distilbert-base-tas-b; say it's similarity function is cosine similarity) are indexed if we map the knn_vector field with a different space_type. (ex. L2)
The distance calculated from the embedding model's weights and the vector distance from a HNSW Graph can differ, leading to inaccurate search scores.
This means that since OpenSearch stores HNSW Graph structures of each segment created by Faiss/NMSLIB/Lucene, search results from the graph could vary depending on the space_type.

What solution would you like?

Are there any benefits to using different space_type values with the similarity function of embedding models?
I suggest displaying warning messages in the above scenario to alert users to potential inaccuracies.

The text was updated successfully, but these errors were encountered:

YeonghyeonKO added enhancement untriaged labels Dec 26, 2024

Provide feedback