-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel/pipelined extraction #208
Conversation
f988e77
to
c2c1f2b
Compare
src/unstable/read.rs
Outdated
let mut data = ZipFileData::from_local_block(block, reader)?; | ||
|
||
match parse_extra_field(&mut data) { | ||
/* FIXME: check for the right error type here instead of accepting any old i/o |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/unstable/read.rs
Outdated
// We can't use the typical ::parse() method, as we follow separate code paths depending | ||
// on the "magic" value (since the magic value will be from the central directory header | ||
// if we've finished iterating over all the actual files). | ||
/* TODO: smallvec? */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/unstable/read.rs
Outdated
} | ||
if let Some(info) = data.aes_mode { | ||
#[cfg(not(feature = "aes-crypto"))] | ||
/* TODO: make this into its own EntryReadError error type! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
#[allow(dead_code)] | ||
pub(crate) fn is_symlink(&self) -> bool { | ||
self.unix_mode() | ||
/* TODO: could this just be != 0? */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
5c99c65
to
94e6940
Compare
src/read/split.rs
Outdated
use std::ops; | ||
|
||
pub trait FixedFile { | ||
/* FIXME: use a type alias instead of raw `u64`? */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/read/split.rs
Outdated
} | ||
|
||
pub trait InputFile: FixedFile { | ||
/* FIXME: this should be MaybeUninit! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/read/pipelining.rs
Outdated
} | ||
|
||
let block = { | ||
/* FIXME: MaybeUninit! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/read/split.rs
Outdated
&mut *(slice as *mut [MaybeUninit<T>] as *mut [T]) | ||
} | ||
|
||
/* TODO: replace with MaybeUninit::copy_from_slice() when stabilized! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/read/split.rs
Outdated
fn pread(&self, start: u64, buf: &mut [MaybeUninit<u8>]) -> io::Result<usize>; | ||
} | ||
|
||
/* TODO: replace with MaybeUninit::slice_assume_init_mut() when stabilized! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/read/pipelining.rs
Outdated
perms_todo, | ||
} = transform_entries_to_allocated_handles(top_level_extraction_dir, trie)?; | ||
|
||
/* TODO: Split up the entries into approximately equal size sequential chunks according to |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/read/pipelining.rs
Outdated
} | ||
} | ||
|
||
/* FIXME: remove this to avoid the dependency on crate::unstable::read! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
28de81b
to
df0f6ff
Compare
src/read/pipelining.rs
Outdated
|
||
/* Create test archive. */ | ||
let mut zip = ZipWriter::new(tempfile::tempfile().unwrap()); | ||
/* FIXME: add a compressed file to test pipelined decompression. */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
e619a05
to
6bea7ad
Compare
src/read/pipelining.rs
Outdated
|
||
use super::*; | ||
|
||
/* FIXME: add a compressed file to test pipelined decompression. */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
6e730d0
to
2beba90
Compare
src/read/pipelining.rs
Outdated
let (compressed_read_end, compressed_write_end) = create_pipe()?; | ||
let (uncompressed_read_end, uncompressed_write_end) = create_pipe()?; | ||
|
||
/* FIXME: Split up the entries into approximately equal size sequential chunks according |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
2beba90
to
1534ba9
Compare
src/read/split.rs
Outdated
pub trait OutputFile: FixedFile { | ||
fn pwrite(&mut self, start: u64, buf: &[u8]) -> io::Result<usize>; | ||
|
||
/* TODO: test this! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
src/read/pipelining.rs
Outdated
uncompressed_sender | ||
.send((entry, output_file)) | ||
.map_err(|_| SplitExtractionError::SendFailed)?; | ||
/* FIXME: use persistent buffer instead of whatever io::copy uses! */ |
Check notice
Code scanning / devskim
A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note
2f40573
to
148d88f
Compare
@Pr0methean this is obviously quite a large change, but it's completely separate from any existing APIs and should hopefully remain that way. I was able to avoid adding any new dependencies except for |
I'll take a look at this once we're caught up on the smaller PRs. |
- initial sketch of lexicographic trie for pipelining - move path splitting into a submodule - lex trie can now propagate entry data - outline handle allocation - mostly handle files - mostly handle dirs - clarify symlink FIXMEs - do symlink validation - extract writable dir setting to helper method - modify args to handle allocation method - handle allocation test passes - simplify perms a lot - outline evaluation - handle symlinks - BIGGER CHANGE! add EntryReader/etc - make initial pipelined extract work - fix file perms by writing them after finishing the file write - support directory entries by unix mode as well - impl split extraction - remove dependency on reader refactoring
148d88f
to
bf43e53
Compare
cc @matthewgapp lmk if this addresses your needs for parallel extraction, please feel free to ping others who might be interested in this as well. I know @sluongng was interested in functionality to modify entry names e.g. stripping a prefix before extraction; I'm hoping to add additional functionality like that + symlink support in followup changes but would like to hear perf feedback + any further benchmarks I should add. Please feel free to leave comments here or on #193 for further work. |
Without much context into this PR, here is the context of what I am trying to build: In some advance build tools such as Bazel, Buck2, etc... There is a common pattern: fetch a source archive from Github, unpack the archive, and use the source inside to build things. Example archives could be declared like this:
Note that the archive could be in multiple different formats: zip, tar.gz, tar.zst, etc... Then the build tool is supposed to fetch the archive from one of the URLs, perform checksum verify over the archive, and extract the archive with the prefix trimmed. Traditionally this is done with the host machine's I am trying to replicate this feature in https://github.com/sluongng/downloader/ with a few additional improvements:
I have been playing with zip2 API for the last few days to get the prefix trimming to work trivially. I hope that this PR, or part of this PR can help me solve my problem more easily 🤗 |
@sluongng thanks so much! I asked for your input here especially so it might be useful in determining the most appropriate way to introduce functionality like prefix trimming. I have introduced a struct Lines 700 to 738 in bf43e53
It's used for perf knobs right now, but could probably be extended to support other configuration, still without messing with the more stable API of the existing Regarding While I was able to avoid introducing too many extra dependencies (zero except for @Pr0methean is a fantastic maintainer and is super focused on making this crate very robust, so I'm saying this out loud so we can figure out where to split responsibility for maintenance so nobody gets overwhelmed. |
9067c37
to
0081371
Compare
0081371
to
90a882c
Compare
RequirementsThinking about how to extract this kind of splitting (and all the
In general, we would also need to expose methods of ImplementationI think this could all be satisfied with the following:
Result
@Pr0methean I have no problem putting in the work to make this change acceptable for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the assumptions this PR makes about filesystems are incorrect. (It's possible that one or two of these bugs are pre-existing, but that's no excuse not to fix them.) This is just a partial review.
/* TODO: do we want to automatically make the directory writable? Wouldn't we prefer to | ||
* respect the write permissions of the extraction dir? Pipelined extraction does not | ||
* mutate permissions like this. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* TODO: do we want to automatically make the directory writable? Wouldn't we prefer to | |
* respect the write permissions of the extraction dir? Pipelined extraction does not | |
* mutate permissions like this. */ |
We need to make the directory temporarily writable, in case it contains files that we need to extract. https://github.com/zip-rs/zip2/blob/master/tests/repro_old423.rs would break otherwise, because it contains a non-empty and non-writable folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense. I was being lazy here 😅 the permissions mechanism perms_todo
totally works to solve this (patterned after the existing .extract()
code), but I was hoping to avoid handling perms like that. However, since this is explicitly only supporting #[cfg(unix)]
targets for now (I'm not sure how to achieve something like pread()
on windows), it might not be as hard as I thought. This should be easy to integrate, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I just added a pipelined version of that test, and it seems to work? 4bbc351 In both the current .extract()
method and pipelined extraction, we only apply perms after all the files and directories are created and written. I'm leaning more towards not trying to circumvent the permissions of existing directories on disk, since I think it's very surprising that a non-writable directory would become writable just because we (e.g. accidentally) extracted a zip file into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense; I agree that only new directories should be temporarily writable.
} | ||
if ret.is_empty() { | ||
return Err(PathSplitError::PathFormat(format!( | ||
"path {:?} resolves to the top-level directory", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid if the destination is the root directory -- we just shouldn't update properties in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, that makes sense. I'll also need to update this code anyway to address making directories temporarily writable as you mentioned in the other comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, clarification: when you say "destination", you're referring to the extraction dir? It's probably not clear, but this method normalize_parent_dirs()
accepts the entire path string (including final filename component), and uses the is_dir
boolean tuple argument to indicate whether to split off the final component as a filename. So if we have ret.is_empty()
, what that means is that the entire path resolves to the equivalent of ./
. Are you saying that ./
directory entries are valid, but we should just ignore them instead of erroring here? That would make sense, just want to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right.
"." => (), | ||
/* If ".." is present, pop off the last element or return an error. */ | ||
".." => { | ||
if ret.pop().is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid if the destination is the root directory, because /..
is the same as /
-- although again, we shouldn't update properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, so this is literally handling the case where instead of a path relative to the extraction dir, the entry is supposed to expand into an absolute path? From the current .extract()
code where we call .enclosed_name()
, it looks like we currently don't support absolute extraction paths? I think my intent here was to mimic that logic, but are you saying instead that we should convert absolute entry names beginning with /
as relative to the extraction dir?
This code is purely processing entry names right now (not yet conjoined to the extraction dir), so I was under the impression they should all be relative paths, and that (like .enclosed_name()
), if they use too many ..
s, that we should error out. I'm not sure how to square that with "if the destination is the root directory". This is probably pretty simple but I would appreciate further clarification here (thanks!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the POSIX standard states that /..
and /
are the same directory. Unless the ZIP APPNOTE says otherwise, I want to follow that standard.
#[derive(PartialEq, Eq, Debug, Clone)] | ||
pub(crate) enum FSEntry<'a, Data> { | ||
Dir(DirEntry<'a, Data>), | ||
File(Data), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZIP files can also contain symlinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid that at first (because it would require reading the symlink targets from the ZipArchive
), but I think I'm ready to implement that pass now.
I think that might have been because I started this from my fork of I'm also going to focus on this PR now (I've just converted my other two open PRs to drafts), so I'm going to also try opening a new version of it against my |
This is now available at #236! I will ping when this is again ready for review (hoping to make progress on it today!). |
Problem
ZipArchive::extract()
corresponds to the way most zip implementations perform the task, but it's single-threaded. This is appropriate under the assumptions imposed by rust'sRead
andSeek
traits, where mutable access is necessary and only one reader can extract file contents at a time, but most unix-like operating systems offer apread()
operation which avoids mutating OS state like the file offset, so multiple threads can read from a file handle at once. The go programming language offersio.ReaderAt
in the stdlib to codify this ability.Solution
This is a rework of #72 which avoids introducing unnecessary thread pools and creates all output file handles and containing directories up front. For large zips, we want to:
src/read/split.rs
was created to coverpread()
and other operations, whilesrc/read/pipelining.rs
was created to perform the high-level logic to split up entries and perform pipelined extraction.Result
parallelism
feature was added to the crate to gate the newly added code + API.libc
crate was added for#[cfg(all(unix, feature = "parallelism"))]
in order to make use of OS-specific functionality.zip::read::split_extract()
was added as a new external API to extract&ZipArchive<fs::File>
when#[cfg(all(unix, feature = "parallelism"))]
.Note that this does not handle symlinks yet, which I plan to add in a followup PR.
CURRENT BENCHMARK STATUS
On a linux host (with
splice()
and optionallycopy_file_range()
), we get about a 6.5x speedup with 12 decompression threads:The performance should keep increasing as we increase thread count, up to the number of available CPU cores (this was running with a parallelism of 12 on my 16-core laptop). This also works on macOS and BSDs, and other
#[cfg(unix)]
platforms.