-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata filter handling for builtin source storages #456
Conversation
Chroma and LanceDB are defunct for now. Here is a demo using full script
from __future__ import annotations
import uuid
from typing import Any, Iterator
from ragna import Rag, source_storages, core
from ragna.core import DocumentUploadParameters, Source
from ragna.deploy import Config
class DummyDocument(core.Document):
async def get_upload_info(
cls, *, config: Config, user: str, id: uuid.UUID, name: str
) -> tuple[dict[str, Any], DocumentUploadParameters]:
raise NotImplementedError
@classmethod
def new(cls, name: str, **metadata) -> DummyDocument:
return cls(
name=name,
metadata=metadata,
handler=core.PlainTextDocumentHandler(),
)
def is_readable(self) -> bool:
return True
def read(self) -> bytes:
return b""
class DummyAssistant(core.Assistant):
def answer(self, prompt: str, sources: list[Source]) -> Iterator[str]:
yield "\n".join(f"- {source.document_name}" for source in sources)
async def main(metadatas, metadata_filters):
documents = [
DummyDocument.new(f"document{idx}.txt", idx=idx, **metadata)
for idx, metadata in enumerate(metadatas)
]
for document in documents:
print(f"- {document.name}: {document.metadata}")
source_storage = source_storages.RagnaDemoSourceStorage()
source_storage.store(documents)
for metadata_filter in metadata_filters:
print("-" * 80)
print(metadata_filter)
print()
chat = Rag().chat(
metadata_filter, source_storage=source_storage, assistant=DummyAssistant
)
answer = await chat.answer("?")
print(answer)
metadatas = [
{
"priority": "low",
"department": "legal",
},
{
"priority": "medium",
"department": "legal",
},
{
"priority": "low",
"department": "marketing",
},
{
"priority": "high",
"department": "marketing",
},
{
"priority": "medium",
"department": "marketing",
},
]
metadata_filters = [
core.MetadataFilter.eq("document_name", "document2.txt"),
core.MetadataFilter.ge("idx", 3),
core.MetadataFilter.eq("department", "legal"),
core.MetadataFilter.in_("priority", ["medium", "high"]),
core.MetadataFilter.and_(
[
core.MetadataFilter.eq("priority", "low"),
core.MetadataFilter.eq("department", "legal"),
]
),
core.MetadataFilter.and_(
[
core.MetadataFilter.or_(
[
core.MetadataFilter.eq("priority", "medium"),
core.MetadataFilter.eq("department", "marketing"),
]
),
core.MetadataFilter.lt("idx", 4),
]
),
]
import asyncio
asyncio.run(main(metadatas, metadata_filters)) [...]
metadata_filters = [
core.MetadataFilter.eq("document_name", "document2.txt"),
core.MetadataFilter.ge("idx", 3),
core.MetadataFilter.eq("department", "legal"),
core.MetadataFilter.in_("priority", ["medium", "high"]),
core.MetadataFilter.and_(
[
core.MetadataFilter.eq("priority", "low"),
core.MetadataFilter.eq("department", "legal"),
]
),
core.MetadataFilter.and_(
[
core.MetadataFilter.or_(
[
core.MetadataFilter.eq("priority", "medium"),
core.MetadataFilter.eq("department", "marketing"),
]
),
core.MetadataFilter.lt("idx", 4),
]
),
]
[...]
|
def _parse_documents(self, documents: Iterable[Any]) -> list[Document]: | ||
documents_ = [] | ||
for document in documents: | ||
def _parse_input( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nenb @blakerosenthal Loosely leaning on #256 (comment), input
is overloaded with 3 values:
- If its explicitly
None
, i.e. no default value and actually callingChat(input=None)
, we don't want any metadata filtering and thus alsometadata_filter=None
. When we pass this down to the source storage, filtering should be disabled and thus the whole index should be used. This is not implemented yet in this PR - If its a
MetadataFilter
, we use it. - If its anything else, we assume it is list of documents or paths to create documents from. For a better UX, we might also allow a single paths or document, but this is not critical.
For 1. and 2. we assume the chat is already prepared and only require preparation for 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier @nenb What remains on the source storage work in particular? It looks like the chat object performs the logic of constructing a MetadataFilter
object to pass along to source storage, and the source storage needs to handle the metadata_filter=None
case and return the entire index. Is there anything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that is implied, but we need handling of the MetadataFilter
in Chroma
and LanceDb
. But I think these are the open tasks right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier @blakerosenthal I had some uncertainty in the task as well.
I've opened #460. We can discuss what needs to be done there. And @blakerosenthal you can push the lancedb
changes to the same branch.
Closes #423.