-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue: ANN (Approximate Nearest Neighbors) Index #25
Comments
Would ustream work? https://github.com/unum-cloud/usearch They even have some sqlite stuff already https://github.com/unum-cloud/usearch/blob/main/sqlite/README.md |
usearch is great! But they don't offer many "hooks" to their storage engine, which would be required for sqlite-vec. We'd want to store the index inside SQLite tables, and balance query time + random lookup times. Also the usearch SQLite functions are just scalar functions, nothing that accesses the HNSW index Also I want to keep sqlite-vec as lightweight as possible, there's no outside dependencies and is a single |
+1 for LM-DiskANN |
yoo bros, lets do this index okay :)) |
@irowberryFS Interestingly, libSQL uses LM-DiskANN for the vector index: |
For those of you who want this, I'd love to hear your thoughts on these questions:
|
Total 20–50 billion (Max vectors * Dimensions). However, I also split the large index into hundreds of smaller indexes containing a total of up to 50 million (Max vectors * Dimensions) each and run searches on those when possible. I haven’t bothered trying the brute-force approach. Benefits for me using
I have tried it. The larger the vectors I start with, the better the results after quantization. However, I like to avoid quantization when possible, especially on smaller indexes. Unsurprisingly, there is a noticeable drop in quality.
Must be under 15 minutes—hard limit. Must be under 30 seconds with a small index. I am currently using HNSW. There are libraries that work for both the Node.js C++ API and WASM in the web. Painful to implement and integrate, but it works. I am already using SQLite WASM and HNSW in the browser separately, so As nice as it would be to have a simpler implementation, HNSW works for me and is my preferred ANN index. Being able to extract the HNSW index out of SQLite into its own file (without extracting individual vectors and rebuilding the index) is also important to me. I am able to convert the HNSW file to and from the Node.js C++ API and WASM libraries, as they are, unfortunately, saved and formatted differently. |
+1 for LM-DiskANN |
Short answerBrute force scales up to 8000 500-word pages or 800 10-page articles on commodity hardware. Longer answerAssuming retrieval should take no longer than 100ms, the brute force approach can handle about 250k embeddings on commodity hardware: import numpy as np
N = 250_000 # Number of embeddings
d = 1024 # The 'median' embedding dimension [1]
embeddings = np.random.randn(N, d).astype(np.float32)
query = np.random.randn(d).astype(np.float32)
%timeit embeddings @ query # Core of the brute force approach
# Output on a Google Colab instance:
# 85 ms ± 9.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) And assuming you store one embedding per sentence (which would be standard in multi-vector retrieval), that's the equivalent of about 8000 500-word pages, or 800 10-page articles, or 40 200-page books.
I see binary embeddings as an optimisation, not a substitute for floating point embeddings. This optimisation could help to scale brute force with another factor of 2 without sacrificing retrieval quality [2], but definitely not an order of magnitude. For that, we'd need an ANN index.
I'd be happy to sacrifice some insert speed in favour of sub-100ms near-100% recall retrieval for 1–10M+ embedding datasets. [1] https://huggingface.co/spaces/mteb/leaderboard |
sqlite-vec
as ofv0.1.0
will be brute-force search only, which slows down on large datasets (>1M w/ large dimensions). I want to include some form of approximate nearest neighbors search beforev1
, which trades accuracy/resource usage for speed.This issue is a general "tracking issue" for how ANN will be implemented in
sqlite-vec
. The open questions I have:Which ANN index should we use?
We want something that fits well with SQLite - meaning storing data in shadow tables, data that fits in pages, low memory usage, etc.
The main options I see:
Unsure which one will turn out best, will need to reseach more. It's possible we add support for all these options.
How should one "declare" an index?
SQLite doesn't have custom indexes, so I think the best way would be to include index info in the
CREATE VIRTUAL TABLE
constructor. Like:create virtual table vec_movies( synopsis_embeddings float[768] INDEXED BY diskann(...) );
or:
syntax heavily depends what ANN index we pick. Also how would training work?
How would they work with metadata filtering?
How do we allow bruteforce + ANN on the same table?
How do we pick between KNN/ANN in a SQL query?
The text was updated successfully, but these errors were encountered: