diff --git a/README.md b/README.md index e2cc5d419..c13b6f631 100644 --- a/README.md +++ b/README.md @@ -119,6 +119,7 @@ Build semantic/similarity/vector/neural search applications. | [Anatomy of a txtai index](https://github.com/neuml/txtai/blob/master/examples/29_Anatomy_of_a_txtai_index.ipynb) | Deep dive into the file formats behind a txtai embeddings index | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/29_Anatomy_of_a_txtai_index.ipynb) | | [Custom Embeddings SQL functions](https://github.com/neuml/txtai/blob/master/examples/30_Embeddings_SQL_custom_functions.ipynb) | Add user-defined functions to Embeddings SQL | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/30_Embeddings_SQL_custom_functions.ipynb) | | [Model explainability](https://github.com/neuml/txtai/blob/master/examples/32_Model_explainability.ipynb) | Explainability for semantic search | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/32_Model_explainability.ipynb) | +| [Query translation](https://github.com/neuml/txtai/blob/master/examples/33_Query_translation.ipynb) | Domain-specific natural language queries with query translation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/33_Query_translation.ipynb) | ### Pipelines diff --git a/docs/examples.md b/docs/examples.md index 100cab577..1634ea243 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -21,6 +21,7 @@ Build semantic/similarity/vector/neural search applications. | [Anatomy of a txtai index](https://github.com/neuml/txtai/blob/master/examples/29_Anatomy_of_a_txtai_index.ipynb) | Deep dive into the file formats behind a txtai embeddings index | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/29_Anatomy_of_a_txtai_index.ipynb) | | [Custom Embeddings SQL functions](https://github.com/neuml/txtai/blob/master/examples/30_Embeddings_SQL_custom_functions.ipynb) | Add user-defined functions to Embeddings SQL | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/30_Embeddings_SQL_custom_functions.ipynb) | | [Model explainability](https://github.com/neuml/txtai/blob/master/examples/32_Model_explainability.ipynb) | Explainability for semantic search | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/32_Model_explainability.ipynb) | +| [Query translation](https://github.com/neuml/txtai/blob/master/examples/33_Query_translation.ipynb) | Domain-specific natural language queries with query translation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/33_Query_translation.ipynb) | ## Pipelines diff --git a/docs/pipeline/text/sequences.md b/docs/pipeline/text/sequences.md index ad7f8cf7f..8e6d953f1 100644 --- a/docs/pipeline/text/sequences.md +++ b/docs/pipeline/text/sequences.md @@ -17,6 +17,12 @@ sequences = Sequences() sequences("Hello, how are you?", "translate English to French: ") ``` +See the link below for a more detailed example. + +| Notebook | Description | | +|:----------|:-------------|------:| +| [Query translation](https://github.com/neuml/txtai/blob/master/examples/33_Query_translation.ipynb) | Domain-specific natural language queries with query translation | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/33_Query_translation.ipynb) | + ## Configuration-driven example Pipelines are run with Python or configuration. Pipelines can be instantiated in [configuration](../../../api/configuration/#pipeline) using the lower case name of the pipeline. Configuration-driven pipelines are run with [workflows](../../../workflow/#configuration-driven-example) or the [API](../../../api#local-instance). diff --git a/examples/33_Query_translation.ipynb b/examples/33_Query_translation.ipynb new file mode 100644 index 000000000..4ef925753 --- /dev/null +++ b/examples/33_Query_translation.ipynb @@ -0,0 +1,515 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "kernelspec": { + "name": "python3", + "display_name": "Python 3", + "language": "python" + }, + "language_info": { + "name": "python", + "version": "3.7.6", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "pygments_lexer": "ipython3", + "nbconvert_exporter": "python", + "file_extension": ".py" + }, + "colab": { + "name": "33 - Query translation", + "provenance": [], + "collapsed_sections": [] + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "POWZoSJR6XzK" + }, + "source": [ + "# Query translation\n", + "\n", + "_This notebook is part of a tutorial series on [txtai](https://github.com/neuml/txtai), an AI-powered semantic search platform._\n", + "\n", + "[txtai](https://github.com/neuml/txtai) executes machine-learning workflows to transform data and build AI-powered semantic search applications.\n", + "\n", + "txtai supports two main types of queries: natural language statements and SQL statements. Natural language queries handles a search engine like query. SQL statements enable more complex filtering, sorting and column selection. Query translation bridges the gap between the two and enables filtering for natural language queries.\n", + "\n", + "For example, the query:\n", + "\n", + "```\n", + "Tell me a feel good story since yesterday\n", + "```\n", + "\n", + "becomes\n", + "\n", + "```sql\n", + "select * from txtai where similar(\"Tell me a feel good story\") and\n", + "entry >= date('now', '-1 day')\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qa_PPKVX6XzN" + }, + "source": [ + "# Install dependencies\n", + "\n", + "Install `txtai` and all dependencies." + ] + }, + { + "cell_type": "code", + "metadata": { + "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", + "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", + "trusted": true, + "_kg_hide-output": true, + "id": "24q-1n5i6XzQ" + }, + "source": [ + "%%capture\n", + "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]" + ], + "execution_count": 29, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Create index\n", + "Let's first recap how to create an index. We'll use the classic txtai example.\n" + ], + "metadata": { + "id": "0p3WCDniUths" + } + }, + { + "cell_type": "code", + "metadata": { + "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a", + "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", + "trusted": true, + "id": "2j_CFGDR6Xzp", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "e65238c5-d67f-4fe4-9cbf-c88b0ff3a8bb" + }, + "source": [ + "from txtai.embeddings import Embeddings\n", + "\n", + "data = [\"US tops 5 million confirmed virus cases\",\n", + " \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n", + " \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n", + " \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n", + " \"Maine man wins $1M from $25 lottery ticket\",\n", + " \"Make huge profits without work, earn up to $100,000 a day\"]\n", + "\n", + "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n", + "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True})\n", + "\n", + "# Create an index for the list of text\n", + "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n", + "\n", + "# Run a search\n", + "embeddings.search(\"feel good story\", 1)" + ], + "execution_count": 30, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'id': '4',\n", + " 'score': 0.08329011499881744,\n", + " 'text': 'Maine man wins $1M from $25 lottery ticket'}]" + ] + }, + "metadata": {}, + "execution_count": 30 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Query translation models\n", + "\n", + "Next we'll explore how query translation models work with examples. " + ], + "metadata": { + "id": "QTee7YMNDD4R" + } + }, + { + "cell_type": "code", + "source": [ + "from txtai.pipeline import Sequences\n", + "\n", + "sequences = Sequences(\"NeuML/t5-small-txtsql\")\n", + "\n", + "queries = [\n", + " \"feel good story\",\n", + " \"feel good story since yesterday\",\n", + " \"feel good story with lottery in text\",\n", + " \"how many feel good story\",\n", + " \"feel good story translated to fr\",\n", + " \"feel good story summarized\"\n", + "]\n", + "\n", + "# Prefix to pass to T5 model\n", + "prefix = \"translate English to SQL: \"\n", + "\n", + "for query in queries:\n", + " print(f\"Input: {query}\")\n", + " print(f\"SQL: {sequences(query, prefix)}\")\n", + " print()\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "04iOgP_ojSfK", + "outputId": "41e73130-75e6-4ae9-b690-685972a13565" + }, + "execution_count": 31, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Input: feel good story\n", + "SQL: select id, text, score from txtai where similar('feel good story')\n", + "\n", + "Input: feel good story since yesterday\n", + "SQL: select id, text, score from txtai where similar('feel good story') and entry >= date('now', '-1 day')\n", + "\n", + "Input: feel good story with lottery in text\n", + "SQL: select id, text, score from txtai where similar('feel good story') and text like '% lottery%'\n", + "\n", + "Input: how many feel good story\n", + "SQL: select count(*) from txtai where similar('feel good story')\n", + "\n", + "Input: feel good story translated to fr\n", + "SQL: select id, translate(text, 'fr') text, score from txtai where similar('feel good story')\n", + "\n", + "Input: feel good story summarized\n", + "SQL: select id, summary(text) text, score from txtai where similar('feel good story')\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Looking at the query translations above gives an idea on how this model works.\n", + "\n", + "[t5-small-txtsql](https://huggingface.co/NeuML/t5-small-txtsql) is the default model. Custom domain query syntax languages can be created using this same methodology, including for other languages. Natural language can be translated to functions, query clauses, column selection and more!" + ], + "metadata": { + "id": "rAnEMaiWlOXm" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Natural language filtering\n", + "\n", + "Now it's time for this in action! Let's first initialize the embeddings index with the appropriate settings." + ], + "metadata": { + "id": "P9hOcgNfjSyL" + } + }, + { + "cell_type": "code", + "source": [ + "from txtai.pipeline import Translation\n", + "\n", + "def translate(text, lang):\n", + " return translation(text, lang)\n", + "\n", + "translation = Translation()\n", + "\n", + "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n", + "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\",\n", + " \"content\": True,\n", + " \"query\": {\"path\": \"NeuML/t5-small-txtsql\"},\n", + " \"functions\": [translate]})\n", + "\n", + "# Create an index for the list of text\n", + "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n", + "\n", + "query = \"select id, score, translate(text, 'de') 'text' from txtai where similar('feel good story')\"\n", + "\n", + "# Run a search using a custom SQL function\n", + "embeddings.search(query)[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "L2DDJrd0RAaN", + "outputId": "c60bfc29-187b-4926-8656-04063b2ca85c" + }, + "execution_count": 32, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'id': '4',\n", + " 'score': 0.08329011499881744,\n", + " 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}" + ] + }, + "metadata": {}, + "execution_count": 32 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Note how the query model was provided as a embeddings index configuration parameter. Custom SQL functions were also added in. Let's now see if the same SQL statement can be run with a natural language query." + ], + "metadata": { + "id": "MNd7QmFmnh-f" + } + }, + { + "cell_type": "code", + "source": [ + "embeddings.search(\"feel good story translated to de\")[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mnGSBNMonur2", + "outputId": "dc14d898-b46f-42e6-88e0-434863cc15d6" + }, + "execution_count": 33, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'id': '4',\n", + " 'score': 0.08329011499881744,\n", + " 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}" + ] + }, + "metadata": {}, + "execution_count": 33 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Same result. Let's try a few more." + ], + "metadata": { + "id": "OnmVUK2DoZkh" + } + }, + { + "cell_type": "code", + "source": [ + "embeddings.search(\"feel good story since yesterday\")[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Js2b_M61oitg", + "outputId": "fab080cb-d082-4cf3-8314-12d23acc3c02" + }, + "execution_count": 34, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'id': '4',\n", + " 'score': 0.08329011499881744,\n", + " 'text': 'Maine man wins $1M from $25 lottery ticket'}" + ] + }, + "metadata": {}, + "execution_count": 34 + } + ] + }, + { + "cell_type": "code", + "source": [ + "embeddings.search(\"feel good story with lottery in text\")[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pcxPgriwomWv", + "outputId": "7e71dd09-7c4f-4147-ba9e-47a45c1cf90f" + }, + "execution_count": 35, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'id': '4',\n", + " 'score': 0.08329011499881744,\n", + " 'text': 'Maine man wins $1M from $25 lottery ticket'}" + ] + }, + "metadata": {}, + "execution_count": 35 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "For good measure, a couple queries with filters that return no results." + ], + "metadata": { + "id": "1ob4Q_CLpGBm" + } + }, + { + "cell_type": "code", + "source": [ + "embeddings.search(\"feel good story with missing in text\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "LMaf7JJzonAC", + "outputId": "71080bd1-2d9e-41ba-9501-2cdedaf26d6d" + }, + "execution_count": 36, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[]" + ] + }, + "metadata": {}, + "execution_count": 36 + } + ] + }, + { + "cell_type": "code", + "source": [ + "embeddings.search(\"feel good story with field equal 14\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ADFNLhvto_0J", + "outputId": "12e08ea8-7203-4fb0-cb84-1133b5db9dc0" + }, + "execution_count": 37, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[]" + ] + }, + "metadata": {}, + "execution_count": 37 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Query translation with applications\n", + "\n", + "Of course this is all available with YAML-configured applications." + ], + "metadata": { + "id": "mTT8nopiRdVH" + } + }, + { + "cell_type": "code", + "source": [ + "config = \"\"\"\n", + "translation:\n", + "\n", + "writable: true\n", + "embeddings:\n", + " path: sentence-transformers/nli-mpnet-base-v2\n", + " content: true\n", + " query:\n", + " path: NeuML/t5-small-txtsql\n", + " functions:\n", + " - {name: translate, argcount: 2, function: translation}\n", + "\"\"\"\n", + "\n", + "from txtai.app import Application\n", + "\n", + "# Build application and index data\n", + "app = Application(config)\n", + "app.add([{\"id\": x, \"text\": row} for x, row in enumerate(data)])\n", + "app.index()\n", + "\n", + "# Run search query\n", + "app.search(\"feel good story translated to de\")[0]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FZ_7G6M4RUbz", + "outputId": "b00ce1c2-8df2-4289-8d03-991174a0e74f" + }, + "execution_count": 38, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'id': '4',\n", + " 'score': 0.08329011499881744,\n", + " 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}" + ] + }, + "metadata": {}, + "execution_count": 38 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aDIF3tYt6X0O" + }, + "source": [ + "# Wrapping up\n", + "\n", + "This notebook introduced natural language filtering with query translation models. This powerful feature adds filtering and pipelines to natural language statements. Custom domain-specific query languages can be created to enable rich queries natively in txtai." + ] + } + ] +} \ No newline at end of file