Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store dependency table as parquet on backend #398

Merged
merged 5 commits into from
May 8, 2024
Merged

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented May 8, 2024

Closes #397

In #372 we switched the format of the dependency table from CSV to PARQUET, which uses already the fast SNAPPY compression algorithm to reduce it's size slightly. We still did compress the file further before uploading it as ZIP to the server. The advantage was that the file was smaller, and that we can download the same ZIP file from the server, independent if the dependency table is stored as PARQUET or CSV.

In #397 we show that file reading and writing is much faster when not zipping the PARQUET file for storage on the backend.
Hence, this pull request removes zipping and puts the PARQUET file directly on the server.
To have a single source of truth implementation it introduces the download_dependencies() and upload_dependencies() functions (not part of the public API), that are then internally used inside audb.dependencies(), audb.publish(), and audb.remove_media().

@hagenw hagenw marked this pull request as draft May 8, 2024 08:34
@hagenw hagenw marked this pull request as ready for review May 8, 2024 08:43
@hagenw hagenw requested a review from ChristianGeng May 8, 2024 08:46
Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description

This stores dependencies for newly loaded deps into parquet format.
This should work as deps.load() and deps.save() infer what to read or write based on the extenstion, so this will not raise compatibility issues.

In addition, the MR also refactors an ugly loop in audb.core.api in favor of two functions in the dependency module, download_dependencies and upload_dependencies. This is nicer to read and in particular download_dependencies contains the management of
define.LEGACY_DEPENDENCIES_FILE, i.e. implements the backward compat with csvs / pickles used hitherto.

I understand that unit testing these new functions will be quite hard, and so it is fair to handle testing in an integration testing approach when calling api.dependencies.

So without further ado, I would say that this can be approved.

@hagenw hagenw merged commit 0737cf9 into dev May 8, 2024
7 checks passed
@hagenw hagenw deleted the store-deps-as-parquet branch May 8, 2024 13:01
hagenw added a commit that referenced this pull request May 8, 2024
* Store dependency table as parquet file on backend

* Reuse code for down/upload of deps

* Improve comment

* Improve docstrings

* Undo unrelated changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants