Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata filter handling for builtin source storages #456

Merged
merged 4 commits into from
Jul 26, 2024

Conversation

pmeier
Copy link
Member

@pmeier pmeier commented Jul 22, 2024

Closes #423.

@pmeier
Copy link
Member Author

pmeier commented Jul 22, 2024

Chroma and LanceDB are defunct for now. Here is a demo using ragna.source_storages.RagnaDemoSourceStorage:

full script

from __future__ import annotations

import uuid
from typing import Any, Iterator

from ragna import Rag, source_storages, core
from ragna.core import DocumentUploadParameters, Source
from ragna.deploy import Config


class DummyDocument(core.Document):
    async def get_upload_info(
        cls, *, config: Config, user: str, id: uuid.UUID, name: str
    ) -> tuple[dict[str, Any], DocumentUploadParameters]:
        raise NotImplementedError

    @classmethod
    def new(cls, name: str, **metadata) -> DummyDocument:
        return cls(
            name=name,
            metadata=metadata,
            handler=core.PlainTextDocumentHandler(),
        )

    def is_readable(self) -> bool:
        return True

    def read(self) -> bytes:
        return b""


class DummyAssistant(core.Assistant):
    def answer(self, prompt: str, sources: list[Source]) -> Iterator[str]:
        yield "\n".join(f"- {source.document_name}" for source in sources)


async def main(metadatas, metadata_filters):
    documents = [
        DummyDocument.new(f"document{idx}.txt", idx=idx, **metadata)
        for idx, metadata in enumerate(metadatas)
    ]
    for document in documents:
        print(f"- {document.name}: {document.metadata}")

    source_storage = source_storages.RagnaDemoSourceStorage()
    source_storage.store(documents)

    for metadata_filter in metadata_filters:
        print("-" * 80)
        print(metadata_filter)
        print()

        chat = Rag().chat(
            metadata_filter, source_storage=source_storage, assistant=DummyAssistant
        )
        answer = await chat.answer("?")
        print(answer)


metadatas = [
    {
        "priority": "low",
        "department": "legal",
    },
    {
        "priority": "medium",
        "department": "legal",
    },
    {
        "priority": "low",
        "department": "marketing",
    },
    {
        "priority": "high",
        "department": "marketing",
    },
    {
        "priority": "medium",
        "department": "marketing",
    },
]

metadata_filters = [
    core.MetadataFilter.eq("document_name", "document2.txt"),
    core.MetadataFilter.ge("idx", 3),
    core.MetadataFilter.eq("department", "legal"),
    core.MetadataFilter.in_("priority", ["medium", "high"]),
    core.MetadataFilter.and_(
        [
            core.MetadataFilter.eq("priority", "low"),
            core.MetadataFilter.eq("department", "legal"),
        ]
    ),
    core.MetadataFilter.and_(
        [
            core.MetadataFilter.or_(
                [
                    core.MetadataFilter.eq("priority", "medium"),
                    core.MetadataFilter.eq("department", "marketing"),
                ]
            ),
            core.MetadataFilter.lt("idx", 4),
        ]
    ),
]

import asyncio

asyncio.run(main(metadatas, metadata_filters))

[...]
metadata_filters = [
    core.MetadataFilter.eq("document_name", "document2.txt"),
    core.MetadataFilter.ge("idx", 3),
    core.MetadataFilter.eq("department", "legal"),
    core.MetadataFilter.in_("priority", ["medium", "high"]),
    core.MetadataFilter.and_(
        [
            core.MetadataFilter.eq("priority", "low"),
            core.MetadataFilter.eq("department", "legal"),
        ]
    ),
    core.MetadataFilter.and_(
        [
            core.MetadataFilter.or_(
                [
                    core.MetadataFilter.eq("priority", "medium"),
                    core.MetadataFilter.eq("department", "marketing"),
                ]
            ),
            core.MetadataFilter.lt("idx", 4),
        ]
    ),
]
[...]
- document0.txt: {'idx': 0, 'priority': 'low', 'department': 'legal'}
- document1.txt: {'idx': 1, 'priority': 'medium', 'department': 'legal'}
- document2.txt: {'idx': 2, 'priority': 'low', 'department': 'marketing'}
- document3.txt: {'idx': 3, 'priority': 'high', 'department': 'marketing'}
- document4.txt: {'idx': 4, 'priority': 'medium', 'department': 'marketing'}
--------------------------------------------------------------------------------
EQ('document_name', 'document2.txt')

- document2.txt
--------------------------------------------------------------------------------
GE('idx', 3)

- document3.txt
- document4.txt
--------------------------------------------------------------------------------
EQ('department', 'legal')

- document0.txt
- document1.txt
--------------------------------------------------------------------------------
IN('priority', ['medium', 'high'])

- document1.txt
- document3.txt
- document4.txt
--------------------------------------------------------------------------------
AND(
  EQ('priority', 'low'),
  EQ('department', 'legal'),
)

- document0.txt
--------------------------------------------------------------------------------
AND(
  OR(
    EQ('priority', 'medium'),
    EQ('department', 'marketing'),
  ),
  LT('idx', 4),
)

- document1.txt
- document2.txt
- document3.txt

def _parse_documents(self, documents: Iterable[Any]) -> list[Document]:
documents_ = []
for document in documents:
def _parse_input(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nenb @blakerosenthal Loosely leaning on #256 (comment), input is overloaded with 3 values:

  1. If its explicitly None, i.e. no default value and actually calling Chat(input=None), we don't want any metadata filtering and thus also metadata_filter=None. When we pass this down to the source storage, filtering should be disabled and thus the whole index should be used. This is not implemented yet in this PR
  2. If its a MetadataFilter, we use it.
  3. If its anything else, we assume it is list of documents or paths to create documents from. For a better UX, we might also allow a single paths or document, but this is not critical.

For 1. and 2. we assume the chat is already prepared and only require preparation for 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier @nenb What remains on the source storage work in particular? It looks like the chat object performs the logic of constructing a MetadataFilter object to pass along to source storage, and the source storage needs to handle the metadata_filter=None case and return the entire index. Is there anything else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that is implied, but we need handling of the MetadataFilter in Chroma and LanceDb. But I think these are the open tasks right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier @blakerosenthal I had some uncertainty in the task as well.

I've opened #460. We can discuss what needs to be done there. And @blakerosenthal you can push the lancedb changes to the same branch.

@pmeier
Copy link
Member Author

pmeier commented Jul 26, 2024

Merging this and use #460 and #461 as follow-ups.

@pmeier pmeier marked this pull request as ready for review July 26, 2024 09:58
@pmeier pmeier merged commit 3d07930 into corpus-dev Jul 26, 2024
14 of 21 checks passed
@pmeier pmeier deleted the metadata-translate branch July 26, 2024 09:58
@pmeier pmeier restored the metadata-translate branch July 26, 2024 09:58
@pmeier pmeier deleted the metadata-translate branch July 26, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants