SEMSIS: Semantic Similarity Search

Semsis is a library for semantic similarity search. It is designed to focus on the following goals:

Simplicity: This library is not rich or complex and implements only the minimum necessary for semantic search.
Maintainability: Unit tests, docstrings, and type hints are all available.
Extensibility: Additional code can be implemented as needed easily.
Efficiency: Billion-scale indexes can be constructed efficiently. See docs/technical_notes.rst for details.

REQUIREMENTS

faiss (see INSTALL.md)
The other requirements are defined in pyproject.toml and can be installed via pip install ./.

INSTALLATION

git clone https://github.com/de9uch1/semsis.git
cd semsis/
pip install ./

USAGE

Case 1: Use semsis as API

You can see the example of text search in end2end_test.py.

Note that this example is not optimized for billion-scale index construction. If you find the efficient implementation, please see semsis/cli/README.rst.

Encode the sentences and store in a key--value store.

from semsis.encoder import SentenceEncoder
from semsis.kvstore import KVStore
from semsis.retriever import RetrieverFaissCPU
import math
import numpy as np

TEXT = [
    "They listen to jazz and he likes jazz piano like Bud Powell.",
    "I really like fruites, especially I love grapes.",
    "I am interested in the k-nearest-neighbor search.",
    "The numpy.squeeze() function is used to remove single-dimensional entries from the shape of an array.",
    "This content is restricted.",
]
QUERYS = [
    "I've implemented some k-nearest-neighbor search algorithms.",
    "I often listen to jazz and I have many CDs which Bud Powell played.",
    "I am interested in the k-nearest-neighbor search.",
]
KVSTORE_PATH = "./kv.bin"
INDEX_PATH = "./index.bin"
INDEX_CONFIG_PATH = "./cfg.yaml"

MODEL = "bert-base-uncased"
REPRESENTATION = "avg"
BATCH_SIZE = 2

encoder = SentenceEncoder.build(MODEL, REPRESENTATION)
dim = encoder.get_embed_dim()
num_sentences = len(TEXT)
with KVStore.open(KVSTORE_PATH, mode="w") as kvstore:
    # Initialize the kvstore.
    kvstore.new(dim)
    for i in range(math.ceil(num_sentences / BATCH_SIZE)):
        b, e = i * BATCH_SIZE, min((i + 1) * BATCH_SIZE, num_sentences)
        sentence_vectors = encoder.encode(TEXT[b:e]).numpy()
        kvstore.add(sentence_vectors)

Next, read the key--value store and build the kNN index.

with KVStore.open(KVSTORE_PATH, mode="r") as kvstore:
    retriever = RetrieverFaissCPU.build(RetrieverFaissCPU.Config(dim))
    retriever.train(kvstore.key[:])
    retriever.add(kvstore.key[:], kvstore.value[:])

retriever.save(INDEX_PATH, INDEX_CONFIG_PATH)

Query.

retriever = RetrieverFaissCPU.load(INDEX_PATH, INDEX_CONFIG_PATH)
query_vectors = encoder.encode(QUERYS).numpy()
distances, indices = retriever.search(query_vectors, k=1)

assert indices.squeeze(1).tolist() == [2, 0, 2]
assert np.isclose(distances[2, 0], 0.0)

Case 2: Use semsis as command line scripts

Command line scripts are carefully designed to run efficiently for the billion-scale search. See semsis/cli/README.rst.

LICENSE

This library is published under the MIT-license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEMSIS: Semantic Similarity Search

REQUIREMENTS

INSTALLATION

USAGE

Case 1: Use semsis as API

Case 2: Use semsis as command line scripts

LICENSE

About

Releases 3

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docs		docs
semsis		semsis
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
pyproject.toml		pyproject.toml

License

de9uch1/semsis

Folders and files

Latest commit

History

Repository files navigation

SEMSIS: Semantic Similarity Search

REQUIREMENTS

INSTALLATION

USAGE

Case 1: Use semsis as API

Case 2: Use semsis as command line scripts

LICENSE

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages