-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Faiss building blocks: clustering, PCA, quantization
Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.
Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor x
is done as follows:
ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
kmeans.train(x)
The resulting centroids are in kmeans.centroids
.
The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj
, and more extensive statistics in kmeans.iteration_stats
.
To run it on GPU, add the option gpu=True
to the Kmeans constructor. This will use all available GPUs on the machine.
The Kmeans
object is mainly a layer of the C++ Clustering
object, and all fields of that object can be set via the constructor.
The fields include:
-
nredo
: run the clustering this number of times, and keep the best centroids (selected according to clustering objective) -
verbose
: make clustering more verbose -
spherical
: perform spherical k-means -- the centroids are L2 normalized after each iteration -
int_centroids
: round centroids coordinates to integer -
update_index
: re-train index after each iteration? -
min_points_per_centroid
/max_points_per_centroid
: below, you get a warning, above the training set is subsampled -
seed
: seed for the random number generator
To compute the mapping from a set of vectors x
to the cluster centroids after kmeans has finished training, use:
D, I = kmeans.index.search(x, 1)
This will return the nearest centroid for each line vector in x
in I
. D
contains the squared L2 distances.
For the reverse operation, eg. to find the 15 nearest points in x
to the computed centroids, a new index must be used:
index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)
I
of size (ncentroids, 15) contains the nearest neighbors for each centroid.
Clustering on one or several GPUs can be done via the gpu=True
(use all gpus) or gpu=3
(use 3 gpus) constructor option in the KMeans
object.
Let's reduce 40D vectors to 10D.
# random training data
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)
The quantizer objects inherit from Quantizer
that offers three common methods (see impl/Quantizer.h) :
-
train
: train the quantizer on a matrix of vectors -
compute_codes
anddecode
: encoder and decoder. The encoder is usually lossy and returns auint8
matrix of codes for each input vector. -
get_DistanceComputer
is a method returning aDistanceComputer
object (see below).
The state of the Quantizer
object is the result of the training (codebook tables or normalization coefficients).
The field code_size
indicates the number of bytes per codes generated by the quantizer.
The supported quantizer types are:
-
ScalarQuantizer
: quantizes each vector component separately in a linear range. -
ProductQuantizer
: performs vector quantization on sub-vectors -
AdditiveQuantizer
: encodes a vector as a sum of codebook entries, see Addtive Quantizers for details. Additive quantizers can be trained in several ways, hence the sub-classesResidualQuantizer
,LocalSearchQuantizer
,ProductAdditiveQuantizer
.
Interestingly, each quantizer is a superset of the previous one.
Each quantizer has a corresponding index type that also stores a set of quantized vectors.
Quantizer class | Flat index class | IVF index class |
---|---|---|
ScalarQuantizer |
IndexScalarQuantizer |
IndexIVFScalarQuantizer |
ProductQuantizer |
IndexPQ |
IndexIVFPQ |
AdditiveQuantizer |
IndexAdditiveQuantizer |
IndexIVFAdditiveQuantizer |
ResidualQuantizer |
IndexResidualQuantizer |
IndexIVFResidualQuantizer |
LocalSearchQuantizer |
IndexLocalSearchQuantizer |
IndexIVFLocalSearchQuantizer |
In addition all indexes except the ScalarQuantizer ones exist in a FastScan
version, see Fast accumulation of PQ and AQ codes
Some quantizers return a DistanceComputer
object.
Its purpose is to compute vector-to-code distances efficiently, when one vector is compared with many codes.
In that case, it is often possible to precompute a set of tables to compute the distances directly in the compressed domain.
Therefore, the DistanceComputer
offers:
-
a
set_query
method that sets the current uncompressed vector to compare with -
a
distance_to_code
method that computes the actual distance with a given code.
The ProductQuantizer
object can be used to encode or decode vectors to codes:
d = 32 # data dimension
cs = 4 # code size (bytes)
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# encode
codes = pq.compute_codes(x)
# decode
x2 = pq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
A scalar quantizer works similarly:
d = 32 # data dimension
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)
# encode
codes = sq.compute_codes(x)
# decode
x2 = sq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark