From 2bd3bf83fd398a7bdf97b65f7a0e6a6d32a90177 Mon Sep 17 00:00:00 2001 From: davidmezzetti <561939+davidmezzetti@users.noreply.github.com> Date: Wed, 9 Mar 2022 18:19:08 -0500 Subject: [PATCH] Update documentation --- docs/embeddings/configuration.md | 94 ++++++++++++++++++-------------- docs/embeddings/query.md | 25 +++++++++ 2 files changed, 78 insertions(+), 41 deletions(-) diff --git a/docs/embeddings/configuration.md b/docs/embeddings/configuration.md index 6bdf49c02..12fe0031e 100644 --- a/docs/embeddings/configuration.md +++ b/docs/embeddings/configuration.md @@ -3,12 +3,22 @@ ## Embeddings This following describes available embeddings configuration. These parameters are set via the [Embeddings constructor](../methods#txtai.embeddings.base.Embeddings.__init__). +### path +```yaml +path: string +``` + +Sets the path for a vectors model. When using a transformers/sentence-transformers model, this can be any model on the +[Hugging Face Model Hub](https://huggingface.co/models) or a local file path. Otherwise, it must be a local file path to a word embeddings model. + ### method ```yaml method: transformers|sentence-transformers|words|external ``` -Sentence embeddings method to use. Options listed below. +Sentence embeddings method to use. If the method is not provided, it is inferred using the `path`. + +`sentence-transformers` and `words` require the [similarity](../../install/#similarity) extras package to be installed. #### transformers @@ -17,32 +27,50 @@ Builds sentence embeddings using a transformers model. While this can be any tra #### sentence-transformers -Same as transformers but loads models with the sentence-transformers library. +Same as transformers but loads models with the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) library. #### words -Builds sentence embeddings using a word embeddings model. +Builds sentence embeddings using a word embeddings model. Transformers models are the preferred vector backend in most cases. Word embeddings models may be deprecated in the future. -#### external +##### storevectors +```yaml +storevectors: boolean +``` + +Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies. -Sentence embeddings are loaded via an external model or API. Requires setting the `transform` parameter to a function that translates data into vectors. +##### scoring +```yaml +scoring: bm25|tfidf|sif +``` -The method is inferred using the _path_, if not provided. sentence-transformers and words require the [similarity](../../install/#similarity) extras package to be installed. +A scoring model builds weighted averages of word vectors for a given sentence. Supports BM25, TF-IDF and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built. -### path +##### pca ```yaml -path: string +pca: int ``` -Sets the path for a vectors model. When using a transformers/sentence-transformers model, this can be any model on the -[Hugging Face Model Hub](https://huggingface.co/models) or a local file path. Otherwise, it must be a local file path to a word embeddings model. +Removes _n_ principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied. + +#### external + +Sentence embeddings are loaded via an external model or API. Requires setting the [transform](#transform) parameter to a function that translates data into vectors. + +##### transform +```yaml +transform: function +``` + +When method is `external`, this function transforms input content into embeddings. ### backend ```yaml backend: faiss|hnsw|annoy ``` -Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to Faiss. Additional backends require the +Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. `Defaults to Faiss`. Additional backends require the [similarity](../../install/#similarity) extras package to be installed. Backend-specific settings are set with a corresponding configuration object having the same name as the backend (i.e. annoy, faiss, or hnsw). None of these are required and are set to defaults if omitted. @@ -91,6 +119,19 @@ content: string|boolean Enables content storage. When true, the default content storage engine will be used. Otherwise, the string must specify the supported content storage engine to use. +### functions +```yaml +functions: list +``` + +List of functions with user-defined SQL functions, only used when [content](#content) is enabled. Each list element must be one of the following: + +- function +- callable object +- dict with fields for name, argcount and function + +[An example can be found here](../query#custom-sql-functions). + ### quantize ```yaml quantize: boolean @@ -99,9 +140,7 @@ quantize: boolean Enables quanitization of generated sentence embeddings. If the index backend supports it, sentence embeddings will be stored with 8-bit precision vs 32-bit. Only Faiss currently supports quantization. -### Additional configuration for Transformers models - -#### tokenize +### tokenize ```yaml tokenize: boolean ``` @@ -109,33 +148,6 @@ tokenize: boolean Enables string tokenization (defaults to false). This method applies tokenization rules that only work with English language text and may increase the quality of English language sentence embeddings in some situations. -### Additional configuration for Word embedding models - -Word embeddings provide a good tradeoff of performance to functionality for a similarity search system. With that being said, Transformers models are making great progress in scaling performance down to smaller models and are the preferred vector backend in txtai for most cases. - -Word embeddings models require the [similarity](../../install/#similarity) extras package to be installed. - -#### storevectors -```yaml -storevectors: boolean -``` - -Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies. - -#### scoring -```yaml -scoring: bm25|tfidf|sif -``` - -A scoring model builds weighted averages of word vectors for a given sentence. Supports BM25, TF-IDF and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built. - -#### pca -```yaml -pca: int -``` - -Removes _n_ principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied. - ## Cloud This section describes parameters used to sync compressed indexes with cloud storage. These parameters are only enabled if an embeddings index is stored as compressed. They are set via the [embeddings.load](../methods/#txtai.embeddings.base.Embeddings.load) and [embeddings.save](../methods/#txtai.embeddings.base.Embeddings.save) methods. diff --git a/docs/embeddings/query.md b/docs/embeddings/query.md index a9dd7f719..df06e226e 100644 --- a/docs/embeddings/query.md +++ b/docs/embeddings/query.md @@ -110,6 +110,31 @@ query = "select object from txtai where similar('machine learning') limit 1" result = embeddings.search(query)[0]["object"] ``` +## Custom SQL functions + +Custom, user-defined SQL functions extend selection, filtering and ordering clauses with additional logic. For example, the following snippet defines a function that translates text using a translation pipeline. + +```python +# Translation pipeline +translate = Translation() + +# Create embeddings index +embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", + "content": True, + "functions": [translate]}) + +# Run a search using a custom SQL function +embeddings.search(""" +select + text, + translation(text, 'de', null) 'text (DE)', + translation(text, 'es', null) 'text (ES)', + translation(text, 'fr', null) 'text (FR)' +from txtai where similar('feel good story') +limit 1 +""") +``` + ## Combined index architecture When content storage is enabled, txtai becomes a dual storage engine. Content is stored in an underlying database (currently supports SQLite) along with an Approximate Nearest Neighbor (ANN) index. These components combine to deliver similarity search alongside traditional structured search.