Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about the redis vector DB index algorithm, changed from HNSW to FLAT #840

Open
gavinlichn opened this issue Oct 31, 2024 · 5 comments
Assignees
Labels
DEV features

Comments

@gavinlichn
Copy link

Aware that dataprep/redis/langchain vector DB index algorithm is FLAT. But remembered we use HNSW before.

Investigating the code, it caused by the removing of index_schema, changed with PR #347
If index_schema removed, redis fall back to default index algorith(FLAT)

Considering that may impact the performance.

Can you help to clarify the background of this change please? @Spycsh

@gavinlichn gavinlichn changed the title Discussion of the redis vector DB index algorithm, changed from HNSW to FLAT Discussion about the redis vector DB index algorithm, changed from HNSW to FLAT Oct 31, 2024
@Spycsh
Copy link
Member

Spycsh commented Oct 31, 2024

Hi @gavinlichn , The PR is to remove the hard length limitation and make the vecdb initialization more simple. I do not think explicitly initializing redis with the schema 768 or 1024 is a concise way. With that PR, if users use BGE base, the redis can automatically accept embedding with length 768, otherwise if users use BGE large, the redis can automatically accept embedding with length 1024. Users do not need to know/change the schema length if they use another embedding model.

However as you said, the schema also contains a non-default index algorithm. Could you please give us some data or reason about how is HNSW faster/better than the default ones? Or can we pass a parameter to Redis to change the default indexing algorithm, which I think is more simple?

@gavinlichn
Copy link
Author

As the intent of PR is to remove the length limitation. prefer to keep the change clean, and not touch more logic.
How about remove the dimension only, and keep other parameters in the previous schema?

For the algorithm comparison, we can refer to Redis' document:

  • Choose the HNSW index type when you have larger datasets (> 1M documents) or when search performance and scalability are more important than perfect search accuracy
  • Choose the FLAT index when you have small datasets (< 1M vectors) or when perfect search accuracy is more important than search latency.

https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/#hnsw-index

@Spycsh
Copy link
Member

Spycsh commented Nov 1, 2024

I agree that let users pass a customized schema is reasonable. I think keep the default behaviors (FLAT, accuracy first) now and allow advanced users to set a schema (maybe set the None as default i.e. REDIS_SCHEMA = os.getenv("REDIS_SCHEMA", None) and users can set REDIS_SCHEMA to the path to their own redis yml). What do you think? Do you want to make a PR for this?

@yinghu5 yinghu5 added the DEV features label Nov 6, 2024
@feng-intel feng-intel assigned feng-intel and Spycsh and unassigned feng-intel Nov 6, 2024
@feng-intel
Copy link
Collaborator

@gavinlichn , for the above Spycsh's solution.

@gavinlichn
Copy link
Author

@gavinlichn , for the above Spycsh's solution.

To include schema by environment variable is reasonable, original designed is similar.
I agree with this solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DEV features
Projects
None yet
Development

No branches or pull requests

4 participants