-
-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Externalized attachments #1021
Comments
Thanks for opening this @fflorent ! Going to dump in some thoughts a developer prepared about this in 2022. Grist has changed somewhat since, and also this was not a plan just one person's thoughts (I personally disagree with some of it), but the set of concerns raised could be helpful for things about this project. Externalizing attachmentsExternal storeWe need a generic interface for storing and retrieving file data that can be implemented in different ways. One obvious data store is S3-compatible stores. Theoretically, the local filesystem might also work. MigrationOnce at least one store is implemented, _gristsys_Files could be deprecated, and a special migration could move the data from there to the external store. The usual Python migration system won’t be enough on its own because the data engine doesn’t see _gristsys_Files, but maybe it could make an external call to node to deal with that. Downloading documentsWe still want to be able to download a single self-contained .grist sqlite file containing all the attachments. When this happens, we’d need to:
This would also allow using downloaded documents in older versions of Grist. Serving attachments without the DocWorkerCurrently the client uses a special DocWorker API to view and download attachments. To serve the files, the DocWorker retrieves them from _gristsys_Files. In the first iteration of work, this would be changed to retrieving them from the external store instead. But in the long term, it would be nice if the client could bypass the DocWorker and retrieve the files directly from the store. S3 would work well for this, but other types of store may not allow this. Deleting externalized attachmentsAttachments are likely to contain sensitive data, and storing them longer than necessary is a security risk. When a user deletes an attachment, it’s reasonable for them to expect it to actually be deleted eventually, just like any other data, so that it can’t be leaked. This applies whether they deleted a row, a document, or an entire organisation. We can’t actually fully delete the data immediately in the first case because deleted rows still live in the snapshot history, but we should delete them eventually. This is like the problem of tracking attachments referenced within a document, on a much larger scale. In this case actually tracking the references (or maybe just their counts) from documents to the external store seems essential. These would need to be updated whenever a document is copied or deleted within a Grist installation. We’d need to consider:
Downloading a document ‘disconnects’ it from the Grist installation so it doesn’t need to be counted. It has its own copy of the attachments so it should either delete or ignore the metadata about the externalized data. EncryptionAn alternative to tracking attachment references to allow deleting them is to encrypt the attachment data to avoid the need to delete it. Each attachment file would have a unique encryption key stored only in the corresponding row of _grist_Attachments . Once all copies of that row are fully deleted, the encryption key should be lost, and decrypting the data in the external store should become impossible. That means we don’t ever have to delete the actual data, so we don’t need to keep track of references to it. Another security benefit of encryption is that if someone gains access to the data in the external attachments store, they can’t actually read it unless they also have the referencing documents. One downside is that serving attachments directly from S3 instead of the DocWorker becomes more tricky. Decrypting and displaying a single encrypted file in the browser using SubtleCrypto and createObjectURL seems straightforward. But it’s a lot more delicate to handle a user scrolling through a grid filled with thumbnails, displaying them all efficiently and then reclaiming memory after they disappear from view. Access ControlWould need thinking about. Important to preserve the property that the existing metadata (particularly fileIdent) is not enough to download the file, so that access is properly revoked even if someone has a past copy of the metadata. It might also be nice if the download URL couldn’t be computed purely from the file content, so that someone with a local copy of a file can’t test whether it exists in the document. |
I added this to the other issue already. If you are looking for inspiration, HackMD has a pretty nice integration of Drag&Drop Image upload, which just puts files in a folder on the server using a random hash. UX-wise it is a pretty nice seamless integration and works flawlessly. |
Just to note that architecture work for this feature has started, with some initial thinking from @Spoffy here: There are trade-offs to be made since Grist will no longer be a standalone single-file format. We're trying to come up with a design that still makes moving Grist docs between installations practical. |
@Spoffy I updated the proposed change to _grist_Files as follows:
If you combine these two changes, you end up with a very harmless low-commitment schema change that seems safe to do even without all details worked out? |
Paul and I have an initial design for this now, found here. The latest design / implementation is in the doc, so I won't copy and paste it here. Next steps are prototyping this approach and run some basic tests to ensure we haven't missed anything that would cause a big problem. |
Quick update on this. The prototype is underway now, and we've got basic attachment functionality working with MinIO and Filesystem storage. |
The first PR for this is open, albeit not yet ready for full review and merging: #1320 It covers the necessary core document changes to make external attachments function, with just about enough configuration to let it be testable. The full scope of that PR is in the PR description 🙂 @fflorent @vviers @hexaltation - if any of you want to have a first look, now is a good time 🙂 |
The idea has been proposed by @paulfitz #886 (comment)
If I understand correctly, and I tend to think that's also something we would like on the ANCT's side: the idea would be to allow creating an attachment column where the file would not be injected in the document but rather in an external storage like S3.
There are several advantages, especially reducing the grist document size and thus allowing increasing the attachment size limit.
The text was updated successfully, but these errors were encountered: