Store dependency table as parquet on backend #398

hagenw · 2024-05-08T08:34:13Z

Closes #397

In #372 we switched the format of the dependency table from CSV to PARQUET, which uses already the fast SNAPPY compression algorithm to reduce it's size slightly. We still did compress the file further before uploading it as ZIP to the server. The advantage was that the file was smaller, and that we can download the same ZIP file from the server, independent if the dependency table is stored as PARQUET or CSV.

In #397 we show that file reading and writing is much faster when not zipping the PARQUET file for storage on the backend.
Hence, this pull request removes zipping and puts the PARQUET file directly on the server.
To have a single source of truth implementation it introduces the download_dependencies() and upload_dependencies() functions (not part of the public API), that are then internally used inside audb.dependencies(), audb.publish(), and audb.remove_media().

ChristianGeng

Description

This stores dependencies for newly loaded deps into parquet format.
This should work as deps.load() and deps.save() infer what to read or write based on the extenstion, so this will not raise compatibility issues.

In addition, the MR also refactors an ugly loop in audb.core.api in favor of two functions in the dependency module, download_dependencies and upload_dependencies. This is nicer to read and in particular download_dependencies contains the management of
define.LEGACY_DEPENDENCIES_FILE, i.e. implements the backward compat with csvs / pickles used hitherto.

I understand that unit testing these new functions will be quite hard, and so it is fair to handle testing in an integration testing approach when calling api.dependencies.

So without further ado, I would say that this can be approved.

* Store dependency table as parquet file on backend * Reuse code for down/upload of deps * Improve comment * Improve docstrings * Undo unrelated changes

hagenw added 2 commits May 8, 2024 10:12

Store dependency table as parquet file on backend

a9d4974

Reuse code for down/upload of deps

dc99407

hagenw marked this pull request as draft May 8, 2024 08:34

hagenw added 3 commits May 8, 2024 10:39

Improve comment

9309945

Improve docstrings

e2fa2de

Undo unrelated changes

a945527

hagenw marked this pull request as ready for review May 8, 2024 08:43

hagenw mentioned this pull request May 8, 2024

Investigate if we should skip zipping of parquet dependency table #397

Closed

hagenw requested a review from ChristianGeng May 8, 2024 08:46

ChristianGeng approved these changes May 8, 2024

View reviewed changes

hagenw merged commit 0737cf9 into dev May 8, 2024
7 checks passed

hagenw deleted the store-deps-as-parquet branch May 8, 2024 13:01

hagenw added a commit that referenced this pull request May 8, 2024

Store dependency table as parquet on backend (#398)

ce57149

* Store dependency table as parquet file on backend * Reuse code for down/upload of deps * Improve comment * Improve docstrings * Undo unrelated changes

This was referenced May 10, 2024

Dependency file error reported when trying to build the documentation locally #402

Closed

Fix comment using old dependency file format #405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store dependency table as parquet on backend #398

Store dependency table as parquet on backend #398

hagenw commented May 8, 2024 •

edited

Loading

ChristianGeng left a comment

Store dependency table as parquet on backend #398

Store dependency table as parquet on backend #398

Conversation

hagenw commented May 8, 2024 • edited Loading

ChristianGeng left a comment

Choose a reason for hiding this comment

Description

hagenw commented May 8, 2024 •

edited

Loading