Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rows returned from similarity_search with filter less than expected #93

Open
CrazyShipOne opened this issue Jul 25, 2024 · 1 comment
Open

Comments

@CrazyShipOne
Copy link

Hi, thanks in advance for this amazing project!

The problem I'm facing is rows returned from function similarity_search always less than expected.
For example, I have 100 rows match filter's condition in table langchain_pg_embedding, by executing

vector_store = PGVector(
            connection=db_str,
            embeddings = embedding_model,
            distance_strategy=DistanceStrategy.EUCLIDEAN
        )
params = {'query':'query_context', 'k':100, 'filter':{'user_name':'user_1'}}
docs = self._vector_store.similarity_search(**params)

I expected length of docs should be 100. However, per different query_context, length of docs are various but less than 100 all the time, like 43 or 67.

So I enabled log for sqlchemy.engine, and got SQL it generated here:

2024-07-25 20:21:47,269 - sqlalchemy.engine.Engine - INFO - SELECT langchain_pg_embedding.id AS langchain_pg_embedding_id, langchain_pg_embedding.collection_id AS langchain_pg_embedding_collection_id, langchain_pg_embedding.embedding AS langchain_pg_embedding_embedding, langchain_pg_embedding.document AS langchain_pg_embedding_document, langchain_pg_embedding.cmetadata AS langchain_pg_embedding_cmetadata, langchain_pg_embedding.embedding <-> %(embedding_1)s AS distance
FROM langchain_pg_embedding JOIN langchain_pg_collection ON langchain_pg_embedding.collection_id = langchain_pg_collection.uuid
WHERE langchain_pg_embedding.collection_id = %(collection_id_1)s::UUID AND jsonb_path_match(langchain_pg_embedding.cmetadata, CAST(%(param_1)s AS JSONPATH), CAST(%(param_2)s::JSONB AS JSONB)) ORDER BY distance ASC
 LIMIT %(param_3)s::INTEGER
2024-07-25 20:21:47,269 - sqlalchemy.engine.Engine - INFO - [generated in 0.00126s] {'embedding_1': '[0.002203210722655058,-0.0034537874162197113,-0.04540061205625534,0.016271447762846947,0.03849130868911743,0.035241544246673584,0.013832494616508484, ... (21402 characters truncated) ... .03421637415885925,-0.03275654464960098,-0.0023437547497451305,0.0021014492958784103,-0.01060000341385603,0.008099209517240524,0.0024598073214292526]', 'collection_id_1': UUID('24b25f49-4e90-4073-9b4b-f5fd91b83c61'), 'param_1': '$.user_name == $value', 'param_2': Jsonb({'value': 'user_1'}), 'param_3': 100}

After replace parameters into sql and execute it in pgAdmin, 100 rows are returned.

Then I dig into code, tried to comment filter in sqlalchemy query
vectorstores.py

results: List[Any] = (
                session.query(
                    self.EmbeddingStore,
                    self.distance_strategy(embedding).label("distance"),
                )
                #This line is commented .filter(*filter_by) 
                .order_by(sqlalchemy.asc("distance"))
                .join(
                    self.CollectionStore,
                    self.EmbeddingStore.collection_id == self.CollectionStore.uuid,
                )
                .limit(k)
                .all()
            )

All 100 rows are returned.

I'm not familiar with sqlalchemy so no clues at all. Could you please take a look on what's happening here? Am I using the search function in wrong way?

Thanks!

@eyurtsev
Copy link
Collaborator

Could you share which query you executed in SQL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants