parallel operations workstream #193

cosmicexplorer · 2024-06-13T21:44:26Z

Is your feature request related to a problem? Please describe.
Zip files always retain an index located separately from each entry's possibly-compressed data. This allows performing high-level split/merge operations without de/recompressing file contents. This produces improved performance on benchmarks compared to serially iterating over each entry to extract, or serially iterating over each file to compress.

Describe the solution you'd like
It's possible to extract zip files in parallel (see #72) as well as merge them to create archives in parallel (see discussion in #73).

Describe alternatives you've considered
While parallel zip extraction as in #72 has likely been implemented elsewhere, to my knowledge the parallel split/merge technique in #73 (researched for pex-tool/pex#2175 and prototyped in https://github.com/cosmicexplorer/medusa-zip) has not been discussed or implemented before in other zip tooling (please let me know of any prior art for this!).

Additional context
TODO:

refactor reader wrappers to use generic type params in refactor readers to use type parameters and not concrete vtables #207 (this gets us Send bounds)
parallel/pipelined extraction in parallel/pipelined extraction #208
bulk copy (no de/recompression) with entry renaming as in consume packed wheel cache in zipapp creation pex-tool/pex#2175
- as in that pex change, bulk copy with renaming enables reconstituting a "parent" zip file from an ordered sequence of "child" zips, which may be used to very quickly reconstruct large zip files from immutable cached components.
- when renaming is not required, ZipWriter::merge_contents() already works with a single io::copy() call. bulk copy with rename avoids de/recompression of file data, but must edit each renamed local file header and therefore requires O(n) io::copy() calls.
parallel split/merge for extremely fast creation as in https://github.com/cosmicexplorer/medusa-zip
- this zip crate should probably not get into the weeds of crawling the filesystem, which keeps medusa-zip useful as a separate crate, and ensures we don't add too much extraneous code to this one.
- however, the process of merging an ordered sequence of "child" zips with ZipWriter::merge_contents() can be parallelized, and this is something the zip crate should be able to do.

The text was updated successfully, but these errors were encountered:

cosmicexplorer added the enhancement New feature or request label Jun 13, 2024

cosmicexplorer self-assigned this Jun 13, 2024

cosmicexplorer mentioned this issue Jun 13, 2024

feat: prototype async API, with demonstrable perf improvements via benchmark #73

Closed

cosmicexplorer mentioned this issue Jun 25, 2024

feat: publicly export and document the zip64 threshold constants #79

Merged

cosmicexplorer mentioned this issue Jul 6, 2024

perf: Pipelined parallel extract #72

Closed

cosmicexplorer mentioned this issue Jul 16, 2024

parallel/pipelined extraction #208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel operations workstream #193

parallel operations workstream #193

cosmicexplorer commented Jun 13, 2024 •

edited

Loading

parallel operations workstream #193

parallel operations workstream #193

Comments

cosmicexplorer commented Jun 13, 2024 • edited Loading

cosmicexplorer commented Jun 13, 2024 •

edited

Loading