Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel/pipelined extraction #208

Closed
wants to merge 3 commits into from

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Jul 6, 2024

Problem

ZipArchive::extract() corresponds to the way most zip implementations perform the task, but it's single-threaded. This is appropriate under the assumptions imposed by rust's Read and Seek traits, where mutable access is necessary and only one reader can extract file contents at a time, but most unix-like operating systems offer a pread() operation which avoids mutating OS state like the file offset, so multiple threads can read from a file handle at once. The go programming language offers io.ReaderAt in the stdlib to codify this ability.

Solution

This is a rework of #72 which avoids introducing unnecessary thread pools and creates all output file handles and containing directories up front. For large zips, we want to:

  • create output handles and containing directories up front,
  • split the input file handle into chunks to process the constituent file entries in parallel,
  • for large compressed entries, pipe their content into a dedicated stream to avoid intermixing i/o and decompression and blocking quick small entries later in the file.

src/read/split.rs was created to cover pread() and other operations, while src/read/pipelining.rs was created to perform the high-level logic to split up entries and perform pipelined extraction.

Result

  • The parallelism feature was added to the crate to gate the newly added code + API.
  • A dependency on the libc crate was added for #[cfg(all(unix, feature = "parallelism"))] in order to make use of OS-specific functionality.
  • zip::read::split_extract() was added as a new external API to extract &ZipArchive<fs::File> when #[cfg(all(unix, feature = "parallelism"))].

Note that this does not handle symlinks yet, which I plan to add in a followup PR.

CURRENT BENCHMARK STATUS

On a linux host (with splice() and optionally copy_file_range()), we get about a 6.5x speedup with 12 decompression threads:

> cargo bench --features parallelism -- extract
running 2 tests
test extract_basic           ... bench: 104,389,978 ns/iter (+/- 5,715,453) = 85 MB/s
test extract_split           ... bench:  16,274,974 ns/iter (+/- 1,530,257) = 546 MB/s

The performance should keep increasing as we increase thread count, up to the number of available CPU cores (this was running with a parallelism of 12 on my 16-core laptop). This also works on macOS and BSDs, and other #[cfg(unix)] platforms.

github-advanced-security[bot]

This comment was marked as resolved.

src/read.rs Dismissed Show dismissed Hide dismissed
let mut data = ZipFileData::from_local_block(block, reader)?;

match parse_extra_field(&mut data) {
/* FIXME: check for the right error type here instead of accepting any old i/o

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
// We can't use the typical ::parse() method, as we follow separate code paths depending
// on the "magic" value (since the magic value will be from the central directory header
// if we've finished iterating over all the actual files).
/* TODO: smallvec? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/unstable/read.rs Dismissed Show dismissed Hide dismissed
}
if let Some(info) = data.aes_mode {
#[cfg(not(feature = "aes-crypto"))]
/* TODO: make this into its own EntryReadError error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/read/pipelining.rs Dismissed Show dismissed Hide dismissed
src/read/pipelining.rs Dismissed Show dismissed Hide dismissed
src/types.rs Dismissed Show dismissed Hide dismissed
#[allow(dead_code)]
pub(crate) fn is_symlink(&self) -> bool {
self.unix_mode()
/* TODO: could this just be != 0? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/read/pipelining.rs Dismissed Show dismissed Hide dismissed
@cosmicexplorer cosmicexplorer force-pushed the pipelining branch 12 times, most recently from 5c99c65 to 94e6940 Compare July 11, 2024 02:08
use std::ops;

pub trait FixedFile {
/* FIXME: use a type alias instead of raw `u64`? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
}

pub trait InputFile: FixedFile {
/* FIXME: this should be MaybeUninit! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
}

let block = {
/* FIXME: MaybeUninit! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/read/pipelining.rs Dismissed Show dismissed Hide dismissed
src/read/pipelining.rs Dismissed Show dismissed Hide dismissed
&mut *(slice as *mut [MaybeUninit<T>] as *mut [T])
}

/* TODO: replace with MaybeUninit::copy_from_slice() when stabilized! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
fn pread(&self, start: u64, buf: &mut [MaybeUninit<u8>]) -> io::Result<usize>;
}

/* TODO: replace with MaybeUninit::slice_assume_init_mut() when stabilized! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
perms_todo,
} = transform_entries_to_allocated_handles(top_level_extraction_dir, trie)?;

/* TODO: Split up the entries into approximately equal size sequential chunks according to

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
}
}

/* FIXME: remove this to avoid the dependency on crate::unstable::read! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment

/* Create test archive. */
let mut zip = ZipWriter::new(tempfile::tempfile().unwrap());
/* FIXME: add a compressed file to test pipelined decompression. */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment

use super::*;

/* FIXME: add a compressed file to test pipelined decompression. */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
@cosmicexplorer cosmicexplorer force-pushed the pipelining branch 2 times, most recently from 6e730d0 to 2beba90 Compare July 12, 2024 01:02
let (compressed_read_end, compressed_write_end) = create_pipe()?;
let (uncompressed_read_end, uncompressed_write_end) = create_pipe()?;

/* FIXME: Split up the entries into approximately equal size sequential chunks according

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
pub trait OutputFile: FixedFile {
fn pwrite(&mut self, start: u64, buf: &[u8]) -> io::Result<usize>;

/* TODO: test this! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
uncompressed_sender
.send((entry, output_file))
.map_err(|_| SplitExtractionError::SendFailed)?;
/* FIXME: use persistent buffer instead of whatever io::copy uses! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
@cosmicexplorer cosmicexplorer changed the title pipelined extraction parallel/pipelined extraction Jul 14, 2024
@cosmicexplorer cosmicexplorer added enhancement New feature or request rust Pull requests that update Rust code labels Jul 14, 2024
@cosmicexplorer
Copy link
Contributor Author

@Pr0methean this is obviously quite a large change, but it's completely separate from any existing APIs and should hopefully remain that way. I was able to avoid adding any new dependencies except for libc to make use of pread() and pipe(), and the external API changes are all gated behind #[cfg(feature = "parallelism")]. Please let me know if there are further types of benchmarks or test cases you would like me to add before taking a further look at it, or if you'd prefer to put it under crate::read::unstable for now to encourage experimentation. I'm quite enthusiastic that this seems to work for any #[cfg(unix)] platform and can be applied to a &ZipArchive<fs::File> instead of requiring users to go through any really complex new API.

@Pr0methean
Copy link
Member

I'll take a look at this once we're caught up on the smaller PRs.

- initial sketch of lexicographic trie for pipelining
- move path splitting into a submodule
- lex trie can now propagate entry data
- outline handle allocation
- mostly handle files
- mostly handle dirs
- clarify symlink FIXMEs
- do symlink validation
- extract writable dir setting to helper method
- modify args to handle allocation method
- handle allocation test passes
- simplify perms a lot
- outline evaluation
- handle symlinks
- BIGGER CHANGE! add EntryReader/etc
- make initial pipelined extract work
- fix file perms by writing them after finishing the file write
- support directory entries by unix mode as well
- impl split extraction
- remove dependency on reader refactoring
@cosmicexplorer
Copy link
Contributor Author

cc @matthewgapp lmk if this addresses your needs for parallel extraction, please feel free to ping others who might be interested in this as well. I know @sluongng was interested in functionality to modify entry names e.g. stripping a prefix before extraction; I'm hoping to add additional functionality like that + symlink support in followup changes but would like to hear perf feedback + any further benchmarks I should add. Please feel free to leave comments here or on #193 for further work.

@sluongng
Copy link

Without much context into this PR, here is the context of what I am trying to build:

In some advance build tools such as Bazel, Buck2, etc... There is a common pattern: fetch a source archive from Github, unpack the archive, and use the source inside to build things. Example archives could be declared like this:

http_archive(
    name = "platforms",
    sha256 = "5308fc1d8865406a49427ba24a9ab53087f17f5266a7aabbfc28823f3916e1ca",
    urls = [
        "https://mirror.bazel.build/github.com/bazelbuild/platforms/releases/download/0.0.6/platforms-0.0.6.tar.gz",
        "https://github.com/bazelbuild/platforms/releases/download/0.0.6/platforms-0.0.6.tar.gz",
    ],
)

http_archive(
    name = "com_google_absl",
    sha256 = "987ce98f02eefbaf930d6e38ab16aa05737234d7afbab2d5c4ea7adbe50c28ed",
    strip_prefix = "abseil-cpp-20230802.1",
    urls = ["https://github.com/abseil/abseil-cpp/archive/refs/tags/20230802.1.tar.gz"],
)

Note that the archive could be in multiple different formats: zip, tar.gz, tar.zst, etc...

Then the build tool is supposed to fetch the archive from one of the URLs, perform checksum verify over the archive, and extract the archive with the prefix trimmed. Traditionally this is done with the host machine's tar command.

I am trying to replicate this feature in https://github.com/sluongng/downloader/ with a few additional improvements:

  • Check if upstream URLs support multi-part download and fetch multiple parts in parallel.
  • Hedge the download request between multiple URLs.
  • Parallel archive extraction with prefix trimming.
  • Do not depend on the host's tar command.

I have been playing with zip2 API for the last few days to get the prefix trimming to work trivially. I hope that this PR, or part of this PR can help me solve my problem more easily 🤗

@cosmicexplorer
Copy link
Contributor Author

@sluongng thanks so much! I asked for your input here especially so it might be useful in determining the most appropriate way to introduce functionality like prefix trimming. I have introduced a struct ExtractionParameters here to the added split_extract() method:

zip2/src/read/pipelining.rs

Lines 700 to 738 in bf43e53

/// Parameters to control the degree of parallelism used for extraction.
#[derive(Debug, Clone)]
pub struct ExtractionParameters {
/// Number of threads used for decompression.
///
/// Default value: 4.
///
/// Note that multiple times this many threads will be spawned by [`split_extract()`] as
/// part of the pipelined process. Only this many threads will be used to perform
/// decompression in rust code, but other threads will be used to wait on I/O from
/// the kernel.
pub decompression_threads: usize,
/// Size of buffer used to copy a decompressed entry into the corresponding output pipe.
///
/// Default value: 1MB.
pub decompression_copy_buffer_length: usize,
/// Size of buffer used to copy stored entries into the output file.
///
/// Used on non-Linux platforms without
/// [`copy_file_range()`](https://www.gnu.org/software/libc/manual/html_node/Copying-File-Data.html),
/// as well as on Linux when the input and output file handles are on separate devices.
///
/// Default value: 1MB.
pub file_range_copy_buffer_length: usize,
/// Size of buffer used to splice contents from a pipe into an output file handle.
///
/// Used on non-Linux platforms without [`splice()`](https://en.wikipedia.org/wiki/Splice_(system_call)).
///
/// Default value: 1MB.
#[cfg(not(target_os = "linux"))]
pub splice_read_buffer_length: usize,
/// Size of buffer used to splice contents from an input file handle into a pipe.
///
/// Used on non-Linux platforms without [`splice()`](https://en.wikipedia.org/wiki/Splice_(system_call)).
///
/// Default value: 1MB.
#[cfg(not(target_os = "linux"))]
pub splice_write_buffer_length: usize,
}

It's used for perf knobs right now, but could probably be extended to support other configuration, still without messing with the more stable API of the existing zip crate.

Regarding tar archives: I have been tossing around with @jonjohnsonjr a sort of higher-level library for manipulating zip and tar archives using some of the indexing tools he's been developing for tars/tgzs (https://github.com/jonjohnsonjr/targz), which make them seekable/splittable like we can do with zip files in this PR. I suspect @Pr0methean is interested in keeping the zip crate focused on zip files, so in that case what might make sense is for me to take everything @Pr0methean isn't interested in maintaining and pulling it out into that separate library for your purposes.

While I was able to avoid introducing too many extra dependencies (zero except for libc) to achieve split extraction in this PR, I am also totally ready for @Pr0methean to say this is too much new code to feel comfortable adding. This is especially reasonable since one of his goals for this crate is safety and support for more features of the zip format, whereas I'm more interested in finding a common subset of zips and tars which allow us to do high-performance splitting and merging (see #193). If so, I would then work to figure out the minimal set of APIs we would want to expose from the zip crate to enable this sort of parallelized work in a separate crate. I created medusa-zip for this purpose previously when the zip crate was unmaintained. and while I've been able to make a lot of changes to the zip crate itself already, this might be the point where we find it's time to break off.

@Pr0methean is a fantastic maintainer and is super focused on making this crate very robust, so I'm saying this out loud so we can figure out where to split responsibility for maintenance so nobody gets overwhelmed.

@cosmicexplorer cosmicexplorer force-pushed the pipelining branch 2 times, most recently from 9067c37 to 0081371 Compare July 17, 2024 00:16
@cosmicexplorer
Copy link
Contributor Author

Requirements

Thinking about how to extract this kind of splitting (and all the libc stuff with pipe() and pread()) into a separate crate. I think the parts we'd need to expose would be:

In general, we would also need to expose methods of ZipFileData without actually exposing the ZipFileData struct itself, so that we could perform the work needed for lexicographic_entry_trie(): https://github.com/zip-rs/zip2/blob/9067c37078922578a8be395310feadab4045afe7/src/read/pipelining.rs#L126-L196

Implementation

I think this could all be satisfied with the following:

  1. Expose crate::spec::is_dir().
  2. Expose a method ZipArchive::get_reader(&self) -> &R which provides a const reference to the internal Read handle.
  3. Expose a new method ZipArchive::iter_data_entries(), which:
    • returns an iterator of opaque ZipFileData handles (using the semi-hidden trait ArchiveEntry from refactor readers to use type parameters and not concrete vtables #207 to avoid leaking ZipFileData),
    • populates ZipFileData::data_start() for each entry as it's handed out by the iterator (or exposes an interface to let the user calculate this; this would be preferable as it would avoid taking an exclusive lock on the ZipArchive file handle).
  4. Expose zip::unstable::read::construct_decompressing_reader() from refactor readers to use type parameters and not concrete vtables #207, so that an external crate can decompress the contents of a zip entry e.g. via a pipe, as is done in this PR:
    pub(crate) fn construct_decompressing_reader<R>(

Result

  • We wouldn't need to modify any of the existing ZipArchive API.
  • We would be able to demonstrate the performance improvements for split extraction in a separate crate, without introducing unsafe code and libc dependencies to the base zip crate.
  • We would be able to kick the tires around on the unstable API from refactor readers to use type parameters and not concrete vtables #207 and get user feedback before making the breaking change from ZipFile => ZipEntry.

@Pr0methean I have no problem putting in the work to make this change acceptable for the zip crate itself, but after the brainstorming above, I think it's likely to be more effective if we instead expose just enough functionality to implement this from a separate crate, especially given the reliance of this PR's approach on platform-specific unsafe syscalls from libc. If you agree on those broad strokes, I will instead begin to rework this change to expose just the functionality needed to implement split extraction from a separate crate. Does that sound like a good idea?

Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the assumptions this PR makes about filesystems are incorrect. (It's possible that one or two of these bugs are pre-existing, but that's no excuse not to fix them.) This is just a partial review.

benches/extract.rs Show resolved Hide resolved
Comment on lines +1086 to +1088
/* TODO: do we want to automatically make the directory writable? Wouldn't we prefer to
* respect the write permissions of the extraction dir? Pipelined extraction does not
* mutate permissions like this. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* TODO: do we want to automatically make the directory writable? Wouldn't we prefer to
* respect the write permissions of the extraction dir? Pipelined extraction does not
* mutate permissions like this. */

We need to make the directory temporarily writable, in case it contains files that we need to extract. https://github.com/zip-rs/zip2/blob/master/tests/repro_old423.rs would break otherwise, because it contains a non-empty and non-writable folder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. I was being lazy here 😅 the permissions mechanism perms_todo totally works to solve this (patterned after the existing .extract() code), but I was hoping to avoid handling perms like that. However, since this is explicitly only supporting #[cfg(unix)] targets for now (I'm not sure how to achieve something like pread() on windows), it might not be as hard as I thought. This should be easy to integrate, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I just added a pipelined version of that test, and it seems to work? 4bbc351 In both the current .extract() method and pipelined extraction, we only apply perms after all the files and directories are created and written. I'm leaning more towards not trying to circumvent the permissions of existing directories on disk, since I think it's very surprising that a non-writable directory would become writable just because we (e.g. accidentally) extracted a zip file into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense; I agree that only new directories should be temporarily writable.

src/read/pipelining.rs Show resolved Hide resolved
}
if ret.is_empty() {
return Err(PathSplitError::PathFormat(format!(
"path {:?} resolves to the top-level directory",
Copy link
Member

@Pr0methean Pr0methean Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid if the destination is the root directory -- we just shouldn't update properties in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, that makes sense. I'll also need to update this code anyway to address making directories temporarily writable as you mentioned in the other comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, clarification: when you say "destination", you're referring to the extraction dir? It's probably not clear, but this method normalize_parent_dirs() accepts the entire path string (including final filename component), and uses the is_dir boolean tuple argument to indicate whether to split off the final component as a filename. So if we have ret.is_empty(), what that means is that the entire path resolves to the equivalent of ./. Are you saying that ./ directory entries are valid, but we should just ignore them instead of erroring here? That would make sense, just want to make sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right.

"." => (),
/* If ".." is present, pop off the last element or return an error. */
".." => {
if ret.pop().is_none() {
Copy link
Member

@Pr0methean Pr0methean Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid if the destination is the root directory, because /.. is the same as / -- although again, we shouldn't update properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so this is literally handling the case where instead of a path relative to the extraction dir, the entry is supposed to expand into an absolute path? From the current .extract() code where we call .enclosed_name(), it looks like we currently don't support absolute extraction paths? I think my intent here was to mimic that logic, but are you saying instead that we should convert absolute entry names beginning with / as relative to the extraction dir?

This code is purely processing entry names right now (not yet conjoined to the extraction dir), so I was under the impression they should all be relative paths, and that (like .enclosed_name()), if they use too many ..s, that we should error out. I'm not sure how to square that with "if the destination is the root directory". This is probably pretty simple but I would appreciate further clarification here (thanks!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the POSIX standard states that /.. and / are the same directory. Unless the ZIP APPNOTE says otherwise, I want to follow that standard.

#[derive(PartialEq, Eq, Debug, Clone)]
pub(crate) enum FSEntry<'a, Data> {
Dir(DirEntry<'a, Data>),
File(Data),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZIP files can also contain symlinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid that at first (because it would require reading the symlink targets from the ZipArchive), but I think I'm ready to implement that pass now.

src/read/pipelining.rs Show resolved Hide resolved
src/read/pipelining.rs Show resolved Hide resolved
@Pr0methean Pr0methean added Please fix failing tests Tests are failing with this change; please fix them. Please address review comments Some review comments are still open. and removed review in progress labels Jul 19, 2024
@cosmicexplorer cosmicexplorer mentioned this pull request Aug 12, 2024
3 tasks
@Pr0methean
Copy link
Member

Pr0methean commented Aug 17, 2024

Screenshot 2024-08-16 at 20 06 33

Looks like we've managed to break GitHub's rebase logic. (I've already tried clicking the "Try again" button several times.)

@cosmicexplorer cosmicexplorer mentioned this pull request Aug 17, 2024
2 tasks
@cosmicexplorer
Copy link
Contributor Author

I think that might have been because I started this from my fork of zip-old (https://github.com/cosmicexplorer/zip), instead of zip2 (https://github.com/cosmicexplorer/zip2) (or at least, that caused an error in a different PR, and I'm pretty sure it's true for this one, since I don't think I've touched it since I switched my origin remote to my zip2 fork).

I'm also going to focus on this PR now (I've just converted my other two open PRs to drafts), so I'm going to also try opening a new version of it against my zip2 fork. I will not miss any of your comments--I will open up new inline review comments where I haven't been able to unambiguously resolve them.

@cosmicexplorer
Copy link
Contributor Author

This is now available at #236! I will ping when this is again ready for review (hoping to make progress on it today!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Please address review comments Some review comments are still open. Please fix failing tests Tests are failing with this change; please fix them. rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants