Add metadata filter handling for builtin source storages #456

pmeier · 2024-07-22T12:28:15Z

Closes #423.

pmeier · 2024-07-22T13:58:23Z

Chroma and LanceDB are defunct for now. Here is a demo using ragna.source_storages.RagnaDemoSourceStorage:

full script

from __future__ import annotations

import uuid
from typing import Any, Iterator

from ragna import Rag, source_storages, core
from ragna.core import DocumentUploadParameters, Source
from ragna.deploy import Config


class DummyDocument(core.Document):
    async def get_upload_info(
        cls, *, config: Config, user: str, id: uuid.UUID, name: str
    ) -> tuple[dict[str, Any], DocumentUploadParameters]:
        raise NotImplementedError

    @classmethod
    def new(cls, name: str, **metadata) -> DummyDocument:
        return cls(
            name=name,
            metadata=metadata,
            handler=core.PlainTextDocumentHandler(),
        )

    def is_readable(self) -> bool:
        return True

    def read(self) -> bytes:
        return b""


class DummyAssistant(core.Assistant):
    def answer(self, prompt: str, sources: list[Source]) -> Iterator[str]:
        yield "\n".join(f"- {source.document_name}" for source in sources)


async def main(metadatas, metadata_filters):
    documents = [
        DummyDocument.new(f"document{idx}.txt", idx=idx, **metadata)
        for idx, metadata in enumerate(metadatas)
    ]
    for document in documents:
        print(f"- {document.name}: {document.metadata}")

    source_storage = source_storages.RagnaDemoSourceStorage()
    source_storage.store(documents)

    for metadata_filter in metadata_filters:
        print("-" * 80)
        print(metadata_filter)
        print()

        chat = Rag().chat(
            metadata_filter, source_storage=source_storage, assistant=DummyAssistant
        )
        answer = await chat.answer("?")
        print(answer)


metadatas = [
    {
        "priority": "low",
        "department": "legal",
    },
    {
        "priority": "medium",
        "department": "legal",
    },
    {
        "priority": "low",
        "department": "marketing",
    },
    {
        "priority": "high",
        "department": "marketing",
    },
    {
        "priority": "medium",
        "department": "marketing",
    },
]

metadata_filters = [
    core.MetadataFilter.eq("document_name", "document2.txt"),
    core.MetadataFilter.ge("idx", 3),
    core.MetadataFilter.eq("department", "legal"),
    core.MetadataFilter.in_("priority", ["medium", "high"]),
    core.MetadataFilter.and_(
        [
            core.MetadataFilter.eq("priority", "low"),
            core.MetadataFilter.eq("department", "legal"),
        ]
    ),
    core.MetadataFilter.and_(
        [
            core.MetadataFilter.or_(
                [
                    core.MetadataFilter.eq("priority", "medium"),
                    core.MetadataFilter.eq("department", "marketing"),
                ]
            ),
            core.MetadataFilter.lt("idx", 4),
        ]
    ),
]

import asyncio

asyncio.run(main(metadatas, metadata_filters))

[...]
metadata_filters = [
    core.MetadataFilter.eq("document_name", "document2.txt"),
    core.MetadataFilter.ge("idx", 3),
    core.MetadataFilter.eq("department", "legal"),
    core.MetadataFilter.in_("priority", ["medium", "high"]),
    core.MetadataFilter.and_(
        [
            core.MetadataFilter.eq("priority", "low"),
            core.MetadataFilter.eq("department", "legal"),
        ]
    ),
    core.MetadataFilter.and_(
        [
            core.MetadataFilter.or_(
                [
                    core.MetadataFilter.eq("priority", "medium"),
                    core.MetadataFilter.eq("department", "marketing"),
                ]
            ),
            core.MetadataFilter.lt("idx", 4),
        ]
    ),
]
[...]

- document0.txt: {'idx': 0, 'priority': 'low', 'department': 'legal'}
- document1.txt: {'idx': 1, 'priority': 'medium', 'department': 'legal'}
- document2.txt: {'idx': 2, 'priority': 'low', 'department': 'marketing'}
- document3.txt: {'idx': 3, 'priority': 'high', 'department': 'marketing'}
- document4.txt: {'idx': 4, 'priority': 'medium', 'department': 'marketing'}
--------------------------------------------------------------------------------
EQ('document_name', 'document2.txt')

- document2.txt
--------------------------------------------------------------------------------
GE('idx', 3)

- document3.txt
- document4.txt
--------------------------------------------------------------------------------
EQ('department', 'legal')

- document0.txt
- document1.txt
--------------------------------------------------------------------------------
IN('priority', ['medium', 'high'])

- document1.txt
- document3.txt
- document4.txt
--------------------------------------------------------------------------------
AND(
  EQ('priority', 'low'),
  EQ('department', 'legal'),
)

- document0.txt
--------------------------------------------------------------------------------
AND(
  OR(
    EQ('priority', 'medium'),
    EQ('department', 'marketing'),
  ),
  LT('idx', 4),
)

- document1.txt
- document2.txt
- document3.txt

pmeier · 2024-07-22T20:32:13Z

ragna/core/_rag.py

-    def _parse_documents(self, documents: Iterable[Any]) -> list[Document]:
-        documents_ = []
-        for document in documents:
+    def _parse_input(


@nenb @blakerosenthal Loosely leaning on #256 (comment), input is overloaded with 3 values:

If its explicitly None, i.e. no default value and actually calling Chat(input=None), we don't want any metadata filtering and thus also metadata_filter=None. When we pass this down to the source storage, filtering should be disabled and thus the whole index should be used. This is not implemented yet in this PR

If its a MetadataFilter, we use it.

If its anything else, we assume it is list of documents or paths to create documents from. For a better UX, we might also allow a single paths or document, but this is not critical.

For 1. and 2. we assume the chat is already prepared and only require preparation for 3.

@pmeier @nenb What remains on the source storage work in particular? It looks like the chat object performs the logic of constructing a MetadataFilter object to pass along to source storage, and the source storage needs to handle the metadata_filter=None case and return the entire index. Is there anything else?

Not sure if that is implied, but we need handling of the MetadataFilter in Chroma and LanceDb. But I think these are the open tasks right now.

@pmeier @blakerosenthal I had some uncertainty in the task as well.

I've opened #460. We can discuss what needs to be done there. And @blakerosenthal you can push the lancedb changes to the same branch.

pmeier · 2024-07-26T09:57:56Z

Merging this and use #460 and #461 as follow-ups.

pmeier added 2 commits July 22, 2024 10:59

add translation logic

4c56c3f

dirty

bae4fc4

pmeier added the dev: corpus label Jul 22, 2024

fix demo assistant

6dd7ede

pmeier commented Jul 22, 2024

View reviewed changes

erge branch 'corpus-dev' into metadata-translate

4faa1bb

pmeier marked this pull request as ready for review July 26, 2024 09:58

pmeier merged commit 3d07930 into corpus-dev Jul 26, 2024
14 of 21 checks passed

pmeier deleted the metadata-translate branch July 26, 2024 09:58

pmeier restored the metadata-translate branch July 26, 2024 09:58

pmeier deleted the metadata-translate branch July 26, 2024 10:13

This was referenced Aug 6, 2024

Make Chat.prepare idempotent #480

Closed

Improve UX for chat interaction #481

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metadata filter handling for builtin source storages #456

Add metadata filter handling for builtin source storages #456

pmeier commented Jul 22, 2024

pmeier commented Jul 22, 2024

pmeier Jul 22, 2024

blakerosenthal Jul 24, 2024

pmeier Jul 25, 2024

nenb Jul 25, 2024

pmeier commented Jul 26, 2024

Add metadata filter handling for builtin source storages #456

Add metadata filter handling for builtin source storages #456

Conversation

pmeier commented Jul 22, 2024

pmeier commented Jul 22, 2024

pmeier Jul 22, 2024

Choose a reason for hiding this comment

blakerosenthal Jul 24, 2024

Choose a reason for hiding this comment

pmeier Jul 25, 2024

Choose a reason for hiding this comment

nenb Jul 25, 2024

Choose a reason for hiding this comment

pmeier commented Jul 26, 2024