Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
davidmezzetti committed Mar 9, 2022
1 parent b0eca57 commit 2bd3bf8
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 41 deletions.
94 changes: 53 additions & 41 deletions docs/embeddings/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,22 @@
## Embeddings
This following describes available embeddings configuration. These parameters are set via the [Embeddings constructor](../methods#txtai.embeddings.base.Embeddings.__init__).

### path
```yaml
path: string
```
Sets the path for a vectors model. When using a transformers/sentence-transformers model, this can be any model on the
[Hugging Face Model Hub](https://huggingface.co/models) or a local file path. Otherwise, it must be a local file path to a word embeddings model.
### method
```yaml
method: transformers|sentence-transformers|words|external
```
Sentence embeddings method to use. Options listed below.
Sentence embeddings method to use. If the method is not provided, it is inferred using the `path`.

`sentence-transformers` and `words` require the [similarity](../../install/#similarity) extras package to be installed.

#### transformers

Expand All @@ -17,32 +27,50 @@ Builds sentence embeddings using a transformers model. While this can be any tra

#### sentence-transformers

Same as transformers but loads models with the sentence-transformers library.
Same as transformers but loads models with the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) library.

#### words

Builds sentence embeddings using a word embeddings model.
Builds sentence embeddings using a word embeddings model. Transformers models are the preferred vector backend in most cases. Word embeddings models may be deprecated in the future.

#### external
##### storevectors
```yaml
storevectors: boolean
```

Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.

Sentence embeddings are loaded via an external model or API. Requires setting the `transform` parameter to a function that translates data into vectors.
##### scoring
```yaml
scoring: bm25|tfidf|sif
```

The method is inferred using the _path_, if not provided. sentence-transformers and words require the [similarity](../../install/#similarity) extras package to be installed.
A scoring model builds weighted averages of word vectors for a given sentence. Supports BM25, TF-IDF and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.

### path
##### pca
```yaml
path: string
pca: int
```

Sets the path for a vectors model. When using a transformers/sentence-transformers model, this can be any model on the
[Hugging Face Model Hub](https://huggingface.co/models) or a local file path. Otherwise, it must be a local file path to a word embeddings model.
Removes _n_ principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.

#### external

Sentence embeddings are loaded via an external model or API. Requires setting the [transform](#transform) parameter to a function that translates data into vectors.

##### transform
```yaml
transform: function
```

When method is `external`, this function transforms input content into embeddings.

### backend
```yaml
backend: faiss|hnsw|annoy
```

Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. Defaults to Faiss. Additional backends require the
Approximate Nearest Neighbor (ANN) index backend for storing generated sentence embeddings. `Defaults to Faiss`. Additional backends require the
[similarity](../../install/#similarity) extras package to be installed.

Backend-specific settings are set with a corresponding configuration object having the same name as the backend (i.e. annoy, faiss, or hnsw). None of these are required and are set to defaults if omitted.
Expand Down Expand Up @@ -91,6 +119,19 @@ content: string|boolean

Enables content storage. When true, the default content storage engine will be used. Otherwise, the string must specify the supported content storage engine to use.

### functions
```yaml
functions: list
```

List of functions with user-defined SQL functions, only used when [content](#content) is enabled. Each list element must be one of the following:

- function
- callable object
- dict with fields for name, argcount and function

[An example can be found here](../query#custom-sql-functions).

### quantize
```yaml
quantize: boolean
Expand All @@ -99,43 +140,14 @@ quantize: boolean
Enables quanitization of generated sentence embeddings. If the index backend supports it, sentence embeddings will be stored with 8-bit precision vs 32-bit.
Only Faiss currently supports quantization.

### Additional configuration for Transformers models

#### tokenize
### tokenize
```yaml
tokenize: boolean
```

Enables string tokenization (defaults to false). This method applies tokenization rules that only work with English language text and may increase the quality of
English language sentence embeddings in some situations.

### Additional configuration for Word embedding models

Word embeddings provide a good tradeoff of performance to functionality for a similarity search system. With that being said, Transformers models are making great progress in scaling performance down to smaller models and are the preferred vector backend in txtai for most cases.

Word embeddings models require the [similarity](../../install/#similarity) extras package to be installed.

#### storevectors
```yaml
storevectors: boolean
```

Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies.

#### scoring
```yaml
scoring: bm25|tfidf|sif
```

A scoring model builds weighted averages of word vectors for a given sentence. Supports BM25, TF-IDF and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built.

#### pca
```yaml
pca: int
```

Removes _n_ principal components from generated sentence embeddings. When enabled, a TruncatedSVD model is built to help with dimensionality reduction. After pooling of vectors creates a single sentence embedding, this method is applied.

## Cloud

This section describes parameters used to sync compressed indexes with cloud storage. These parameters are only enabled if an embeddings index is stored as compressed. They are set via the [embeddings.load](../methods/#txtai.embeddings.base.Embeddings.load) and [embeddings.save](../methods/#txtai.embeddings.base.Embeddings.save) methods.
Expand Down
25 changes: 25 additions & 0 deletions docs/embeddings/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,31 @@ query = "select object from txtai where similar('machine learning') limit 1"
result = embeddings.search(query)[0]["object"]
```

## Custom SQL functions

Custom, user-defined SQL functions extend selection, filtering and ordering clauses with additional logic. For example, the following snippet defines a function that translates text using a translation pipeline.

```python
# Translation pipeline
translate = Translation()

# Create embeddings index
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2",
"content": True,
"functions": [translate]})

# Run a search using a custom SQL function
embeddings.search("""
select
text,
translation(text, 'de', null) 'text (DE)',
translation(text, 'es', null) 'text (ES)',
translation(text, 'fr', null) 'text (FR)'
from txtai where similar('feel good story')
limit 1
""")
```

## Combined index architecture

When content storage is enabled, txtai becomes a dual storage engine. Content is stored in an underlying database (currently supports SQLite) along with an Approximate Nearest Neighbor (ANN) index. These components combine to deliver similarity search alongside traditional structured search.
Expand Down

0 comments on commit 2bd3bf8

Please sign in to comment.