-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Implementation notes
Here are a few notes on implementation details for which we sometimes get questions. We describe the tradeoffs and maybe a few unexpected design choices or results.
A typical operation in IndexFlatL2
is to exhaustively compare a set of nq
query vectors and a set of nb
database vectors in dimension d (then select the top-k smallest vectors).
Faiss has two implementations of this operation:
-
direct implementation that loops over nq, nb and the dimension of the vectors.
-
an implementation that uses the decomposition
$||x - y||^2 = ||x||^2 + ||y||^2 - 2 \left< x, y \right>$ . This is faster because the most expensive operation in O(nq * nb * d) can be handed over to BLAS that normally does this efficiently.
We use implementation 1 when nq < 20, and implementation 2 otherwise.
The threshold 20 can be adjusted via global variable faiss::distance_compute_blas_threshold
(accessible in Python via faiss.cvar.distance_compute_blas_threshold
).
Note that solution 2 may be less stable numerically than 1 for vectors of very different magnitudes, see discussion in issue #297.
k-means is implemented in the Clustering object
After initialization, the k-means iterates two operations:
-
assign training points to centroids
-
recompute centroids as the center of mass of the points they are assigned to.
In terms of performance, the first operation is the most costly (by far). Incidentally, it can be performed by any index, since it is a nearest-neighbor search of the vectors to the centroids. Therefore the index is a parameter of the Clustering train method. It can be replaced with a GPU index (example) or a HNSW index (example).
PQ search boils down to doing look-ups in distance tables (as explained in the original paper “Product quantization for nearest neighbor search”).
For the IVFPQ, an additional level of pre-computation can be done to speed up the construction of these tables. This is explained in "Billion-scale similarity search with GPUs", section 5.2. It is necessary only for L2 distance computations and residual encoding (which is the default). There is a tradeoff between memory usage to store the additional tables and speed:
-
on CPU, it is governed by
IndexIVFPQ.use_precomputed_table
. It takes 4 possible values: -1=disable, 0=decide heuristically (default: use tables only if they are <IndexIVFPQ::precomputed_tables_max_bytes
, set to 2GB by default); 1=enable (size 256 * nlist * M); 2=specific version for theMultiIndexQuantizer
that is much more compact. Callingprecompute_table()
takesuse_precomputed_table
into account and updates the data for the next search. -
on GPU, precomputed tables are enabled if the
GpuIndexIVFConfig::usePrecomputedTables
is set to true at construction. The setting can be changed at runtime withGpuIndexIVF::setPrecomputedCodes(bool enable)
.
Trained ProductQuantizer
struct maintains a list of centroids in an 1D array field called ::centroids
, its layout is (M, ksub, dsub)
. This allows to access the coordinates of the centroids directly.
Faiss 1.7.3 introduces two new fields, which allow to perform the calls to ProductQuantizer::compute_code()
faster:
-
::transposed_centroids
which stores the coordinates of the centroids, its layout is(dsub, M, ksub)
. -
::centroids_sq_lengths
which stores the squared lengths of the centroids, its layout is(M, ksub)
.
These fields are not supposed to be accessed directly and are treated as temporary ones.
Instead,
- Prepare a ProductQuantizer
- Edit the coordinates of the centroids by training the product quantizer or accessing
::centroids
field, if needed. - Use a new function
ProductQuantizer::sync_transposed_centroids()
which automatically populates these two fields based on the values in::centroids
field. As a result, a different, more computationally efficient kernel will be used to evaluateProductQuantizer::compute_code()
call. Currently, the kernel is available fordsub
=1, 2, 4 and 8. - Perform
ProductQuantizer::compute_code()
orProductQuantizer::compute_codes()
. - Use
ProductQuantizer::clear_transposed_centroids()
to clear this buffer, if it is not needed any more.
The code snippet for ProductQuantizer is the following:
# prepare pq
pq = faiss.ProductQuantizer(...)
pq.train(...)
# enable transposed centroids table to speedup compute_codes()
pq.sync_transposed_centroids()
# compute
codes_transposed = pq.compute_codes(...)
# disable transposed centroids table once it is no longer needed
pq.clear_transposed_centroids()
Please note that pq.compute_codes()
might produce a bit different results if transposed centroids tables are used (compared to the case if ones are not used) because of the finite precision of floating point values.
The PCA matrix is computed by:
-
if there are more training points than dimensions: compute the covariance matrix, then use the LAPACK dsyev to extract the eigenvalues and vectors.
-
if there are more dimensions than training points: compute the Gram matrix (pairwise dot products of training vectors), and extract the eigenvalues and vectors from that.
Non-exhaustive search means that only part of the database vectors are compared with.
This is the case with IndexIVF
variants and IndexHSNW
variants.
The number of distances that are effectively computed are collected in global variables.
For IndexIVF
this is indexIVF_stats.ndis
, for HNSW
it is hnsw_stats.ndis
.
These statistics are guaranteed to be accurate only if the search
function is not called from multiple threads.
To access the variables in Python, just prefix with faiss.cvar
, eg. faiss.cvar.indexIVF_stats.ndis
will contain the number of distances computed so far.
See also Fast accumulation of PQ and AQ codes (FastScan)
Most product quantizer (PQ) decompositions use 8 bits per sub-vector, which is convenient because it is byte-aligned. At search time, the look-up tables are stored in RAM (hopefully in cache) and can be used for distance computations, see equation (13) in Product quantization for nearest neighbor search.
However, modern CPUs are more efficient when computing arithmetic operations than doing memory look-ups. One way to avoid this is to degrade to less powerful vector encodings (like binary or scalar decompositions).
Another way is to store the search-time look-up tables in registers. This idea was introduced in "Cache locality is not enough: high-performance nearest neighbor search with product quantization fast scan", André et al, VLDB 15, hence the name "fast-scan" of the current implementation.
Fitting look-up tables in registers is possible only under the following conditions:
-
there must be many vector registers and there must be a look-up instruction (aka. shuffle). This is supported on architectures like the Intel AVX (16 256-bit registers) or ARM Neon (32 128-bit registers).
-
the LUT tables must be short: 8-bit PQ produces 256 LUT entries, so reduce it to 4-bit.
-
the LUT table entries must be small: we cannot afford to store floating point entries, so they are quantized to 8 bit per entry.
The 4-bit PQ fast-scan implementation is heavily inspired by the Google SCANN.
It supports only AVX2 currently. Compiling without AVX2 works but is very inefficient because the vector registers are emulated in memory.
The IndexPQFastScan
and IndexIVFPQFastScan
objects implement PQ4 fast scan.
The main difference with the regular IndexPQ
and IndexIVFPQ
is that the codes are laid out by batches of 32 (64 and 96 are supported as well but there are few operating points where they are competitive).
Since the 4-bit PQ has a relatively low accuracy (PQ32x4 is much less accurate than PQ16x8 although they use the same amount of memory), it is useful to perform a re-ranking stage. For example, if we need k=10 results, we instead query k * k_factor_rf = 100 elements and rerank the top-100 resutls per query with exact (or more accurate) distance computations.
This 2-stage encoding is used in several works, including SCANN, the 3-level PQ search in Searching in one billion vectors: Re-rank with source coding , Jegou et al, ICASSP'11, DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node, Subramanya et al. NeurIPS'19.
In Faiss, this is supported via the IndexRefine
object.
It manages two indexes, the second one being refining the results from the first index.
For some Faiss use cases, having efficient implementations for very basic functions is critical.
faiss/utils/sorting.h exposes efficient sorting functions:
-
fvec_argsort_parallel
is a parallel argsort implementation -
bucket_sort
is a bucket sort function, ie it shuffles the input array (indirectly, with a permutation) so that the array entries 0 come first, then 1, ..., thennbucket-1
. It is exposed in Python asfaiss.bucket_sort
-
matrix_bucket_sort_inplace
is a bit more complicated: it sorts elements inplace and in addition records from which row each elements comes. In Python:faiss.matrix_bucket_sort
-
hashtable_int64_to_int64_init
,..._add
,..._lookup
: a parallel hashtable implementation. Allows to add and search batches of (key, value) pairs (both int64) to a hashtable stored in an external table. Exposed in Python asMapInt64ToInt64
The product quantizer and additive quantizers produce a series of
When
When
-
the 6 low bits of byte 1 contain
$q_1$ -
the 2 high bits of byte 1 contain the 2 low bits of
$q_2$ -
the 4 low bits of byte 2 contain the 4 high bits of
$q_2$
etc.
This memory representation is sometimes used as a single integer.
This is useful in particular when the whole bit string is used as a coarse quantizer index, eg. in MulitIndexCoarseQuantizer
or AdditiveCoarseQuantizer
.
What's convenient is that on a little endian machine (Intel and ARM machines are little endian), this means that the bits are populated from lowest to highest.
In the previous example, with q
:
-
$q_1$ is stored int the 6 lowest bits ofq
:$q_1$ =q & 0x3f
-
$q_2$ is stored int the bits 6..11 ofq
:$q_2$ =(q >> 6) & 0x3f
etc.
A general way of reading these bits is via the BitstringReader
object, see this code.
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark