Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor document registering and upload #441

Merged
merged 3 commits into from
Jul 9, 2024
Merged

Conversation

pmeier
Copy link
Member

@pmeier pmeier commented Jul 8, 2024

Follow-up to #434 (comment). Highlights:

  • Before this PR our upload method is a two step process: at first we register the documents with Ragna and return upload parameters. In a second step we just act on the return value and send the documents. This process is very flexible as it allows uploading the document to almost everywhere, e.g. an S3 bucket. However, this process is also opaque and for most use cases unnecessary complex: from what we have seen so far, users either upload to the local disk or don't upload at all (see "managed" Ragna #256) and just want to register. In a proper deployment it is also possible to just mount an external storage on the path that Ragna tries to store the documents and thus eliminating the need to upload to a different sink directly.

    Thus, this PR simplifies a lot of things. The register endpoint now only returns the document schema, which includes its ID. With that one can hit the upload endpoint, which will unconditionally upload to the local disk. This allows us to remove the Document.get_upload_info and LocalDocument.decode_upload_token methods.

  • The document register endpoint now also accepts metadata to be stored alongside the document. Previously, metadata could only be attached by the the document class itself by adding it to the upload information and decoding it again on the upload (yes this is complex and a reason why I refactored it). In addition to the metadata passed to the endpoint, the document class now also can update the metadata in its constructor for default behavior. This is what we do for the path of a LocalDocument

  • Add endpoint for batch uploading document metadata #404 added support to register multiple documents at the same time. This PR closes Complete support for batch document upload #407 by also supporting batch upload of documents. In fact, we remove the single register / upload endpoints, because they make little sense now.

  • The Document.is_readable abstract method was removed. This was only used to check before we try to read and be able to raise a good error if something is wrong. However, this check might be expensive. Plus, it is reasonable to assume that Document.read() will raise a proper error message and thus circumventing the need for a pre-check all together.

Since these are quite a few changes already, I refrained from also touching the UI as well. I have the code ready to be reviewed as soon as this PR is accepted. Note that the change to the documents currently leaves the UI defunct.

Copy link
Contributor

@nenb nenb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm less familiar with the parts that touch the new 'engine', but the rest of the PR seemed sensible to me. I left some clarification comments.

ragna/core/_document.py Outdated Show resolved Hide resolved
ragna/core/_document.py Outdated Show resolved Hide resolved

@router.put("/documents")
async def upload_documents(
user: UserDependency, documents: list[UploadFile]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does this load all files into memory on the server?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not. The UploadFile from FastAPI actually wraps a SpooledTemporaryFile, i.e. it will only be kept in memory for small files and otherwise temporarily be stored on disk.

@pmeier pmeier merged commit 40f0b6c into deploy-dev Jul 9, 2024
3 of 9 checks passed
@pmeier pmeier deleted the document-engine branch July 9, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants