Semsis is a library for semantic similarity search. It is designed to focus on the following goals:
- Simplicity: This library is not rich or complex and implements only the minimum necessary for semantic search.
- Maintainability: Unit tests, docstrings, and type hints are all available.
- Extensibility: Additional code can be implemented as needed easily.
- Efficiency: Billion-scale indexes can be constructed efficiently. See docs/technical_notes.rst for details.
- faiss (see INSTALL.md)
- The other requirements are defined in
pyproject.toml
and can be installed viapip install ./
.
git clone https://github.com/de9uch1/semsis.git
cd semsis/
pip install ./
You can see the example of text search in end2end_test.py.
Note that this example is not optimized for billion-scale index construction. If you find the efficient implementation, please see semsis/cli/README.rst.
- Encode the sentences and store in a key--value store.
from semsis.encoder import SentenceEncoder
from semsis.kvstore import KVStore
from semsis.retriever import RetrieverFaissCPU
import math
import numpy as np
TEXT = [
"They listen to jazz and he likes jazz piano like Bud Powell.",
"I really like fruites, especially I love grapes.",
"I am interested in the k-nearest-neighbor search.",
"The numpy.squeeze() function is used to remove single-dimensional entries from the shape of an array.",
"This content is restricted.",
]
QUERYS = [
"I've implemented some k-nearest-neighbor search algorithms.",
"I often listen to jazz and I have many CDs which Bud Powell played.",
"I am interested in the k-nearest-neighbor search.",
]
KVSTORE_PATH = "./kv.bin"
INDEX_PATH = "./index.bin"
INDEX_CONFIG_PATH = "./cfg.yaml"
MODEL = "bert-base-uncased"
REPRESENTATION = "avg"
BATCH_SIZE = 2
encoder = SentenceEncoder.build(MODEL, REPRESENTATION)
dim = encoder.get_embed_dim()
num_sentences = len(TEXT)
with KVStore.open(KVSTORE_PATH, mode="w") as kvstore:
# Initialize the kvstore.
kvstore.new(dim)
for i in range(math.ceil(num_sentences / BATCH_SIZE)):
b, e = i * BATCH_SIZE, min((i + 1) * BATCH_SIZE, num_sentences)
sentence_vectors = encoder.encode(TEXT[b:e]).numpy()
kvstore.add(sentence_vectors)
- Next, read the key--value store and build the kNN index.
with KVStore.open(KVSTORE_PATH, mode="r") as kvstore:
retriever = RetrieverFaissCPU.build(RetrieverFaissCPU.Config(dim))
retriever.train(kvstore.key[:])
retriever.add(kvstore.key[:], kvstore.value[:])
retriever.save(INDEX_PATH, INDEX_CONFIG_PATH)
- Query.
retriever = RetrieverFaissCPU.load(INDEX_PATH, INDEX_CONFIG_PATH)
query_vectors = encoder.encode(QUERYS).numpy()
distances, indices = retriever.search(query_vectors, k=1)
assert indices.squeeze(1).tolist() == [2, 0, 2]
assert np.isclose(distances[2, 0], 0.0)
Command line scripts are carefully designed to run efficiently for the billion-scale search. See semsis/cli/README.rst.
This library is published under the MIT-license.