Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: prototype async API, with demonstrable perf improvements via benchmark #73

Closed
wants to merge 162 commits into from

Conversation

Pr0methean
Copy link
Member

This replaces zip-rs/zip-old#409.

/* } */

/* let write_data = ready!(s.ring.poll_write(cx, s.ring.capacity())); */
/* /\* FIXME: i'm pretty sure this can cause a segfault with the current */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
use std::{cmp, task::ready};

pub trait KnownExpanse {
/* TODO: make this have a parameterized Self::Index type, used e.g. with RangeInclusive or

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
let fname = file.file_name.as_bytes();

if writer.is_write_vectored() {
/* TODO: zero-copy!! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
let fname = file.file_name.as_bytes();

if writer.is_write_vectored() {
/* TODO: zero-copy!! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
let mut name_as_string = name.into();
// Append a slash to the filename if it does not end with it.
if !name_as_string.ends_with('/') {
/* TODO: ends_with('\\') as well? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
};

if writer.is_write_vectored() {
/* TODO: zero-copy?? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
};

if writer.is_write_vectored() {
/* TODO: zero-copy?? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
) -> impl Stream<Item = ZipResult<Pin<Box<ZipFile<'_, S, Limiter<S>, Sh>>>>> + '_ {
let len = self.shared.len();
let s = std::cell::UnsafeCell::new(self);
/* FIXME: make this a stream with a known length! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
};

if writer.is_write_vectored() {
/* TODO: zero-copy!! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment

#[inline]
pub fn is_dir(&self) -> bool {
/* TODO: '\\' too? */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
@matthewgapp
Copy link

matthewgapp commented Jun 12, 2024

is this still in progress @Pr0methean? Would love to see parallelization in this crate! Thanks for your work :D

@Pr0methean
Copy link
Member Author

You'll have to ask @cosmicexplorer, since he's the PR author.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Jun 13, 2024

Thanks for the ping! I think #72 is probably worth pursuing first, as it achieves parallelism without the separate effort of an async API (which via tokio includes Send and some Sync bounds). This should be easier to reliably benchmark and avoids some of the slowdown achieved by moving to async i/o, which was only regained with quite a bit of extra work in this change. When originally uploaded, this PR was largely a chance for me to learn more about async implementation, but at this point I think the amount of changes necessary ends up making this almost a completely new crate. I was ready to do that before @Pr0methean took this over, but I think making ZipArchive Send is probably sufficient to use it in most async code in a blocking thread, and (unless that's already been done in the meantime) that requires a few very small changes instead of this entire massive one.

I will definitely take a look at updating #72, since it makes use of several techniques I initially developed for pex-tool/pex#2175 (which I should also get back to) and which I don't believe have yet been implemented elsewhere. @matthewgapp let me know if synchronous parallel extraction/zipping serves your purposes -- I think an async version of that (which is what this change does) would be very cool, but it requires a lot of extra code, and is slower on all the benchmarks I've tried. If there is some user that's performing a very large number of parallel zip extractions and finds this PR to be more performant than #72, I think it might be worth taking a look at again, but I would at least want to know what performance scenario could possibly produce that, because I wasn't able to figure out how to create such a scenario on my laptop.

@cosmicexplorer
Copy link
Contributor

The massive amount of changes required here mean I would prefer not to maintain this myself or force @Pr0methean to do so unless there is a very specific use case for which this achieves superior performance to the synchronous work in #72. In general, I'm thinking that if #72 does its job right, it should be fast enough that we don't need to rely on async, which in rust is better suited for doing less work at a time rather than doing work faster. I would love to be proven wrong here, but otherwise this work was a fun experiment and probably not suitable for production. Please anyone feel free to comment should there be any performance scenario for which async would be better suited.

auto-merge was automatically disabled June 13, 2024 14:34

Pull request was closed

@matthewgapp
Copy link

@cosmicexplorer sync parallel execution is just fine! We'd be using it in async code so it would have to be Send.

@cosmicexplorer
Copy link
Contributor

Ok, great!! Now that I have @Pr0methean reviewing my work I'm thinking a full async API might not be a terrible idea at some point, but I might want to try an mmap-based implementation first. I started playing around with linux-specific splice and copy_file_range and I suspect that would accelerate the split/merge operations used for bulk creation/extraction, but mmap may be more performant for random access operations. In general the performance improvements from here on out probably require using a higher-level and less-general API (as per my research in https://github.com/cosmicexplorer/medusa-zip), and I don't expect they'd replace the standard API of this crate.

@cosmicexplorer
Copy link
Contributor

Actually, it looks like #72 only contains the (comparatively simpler) pipelined extraction work, without the more experimental (but extremely useful) split/merged zip creation work. I've created #193 to track this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants