-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is on the right track; please fix the failing tests.
@@ -56,6 +54,8 @@ pub(crate) mod zip_archive { | |||
// This isn't yet used anywhere, but it is here for use cases in the future. | |||
#[allow(dead_code)] | |||
pub(super) config: super::Config, | |||
pub(crate) comment: Box<[u8]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an Option
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unnecessary. Technically, the comment section is always present, however it can have length zero. This is a state that can be represented by an empty box slice. The ZIP64 comment, however, might not be present at all (when it's actually a ZIP32), so there the distinction makes sense.
src/read.rs
Outdated
Ok((Rc::try_unwrap(footer).unwrap(), shared.build())) | ||
pub(crate) fn get_metadata(config: Config, reader: &mut R) -> ZipResult<Shared> { | ||
// Find the EOCD and possibly EOCD64 entries and determine the archive offset. | ||
let cde = spec::find_central_directory(reader, config.archive_offset)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if something that looks like a valid EOCD or EOCD64 block, but doesn't have a valid central directory in front of it and thus fails try_from
, is included in the file comment of a valid ZIP file? We should keep looking for the real one in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wrap this in a loop and allow the find_central_directory
function to continue from the previous EOCD candidate backwards.
src/read/magic_finder.rs
Outdated
|
||
// Smaller buffer size would be unable to locate bytes. | ||
// Equal buffer size would stall (the window could not be moved). | ||
debug_assert!(BUFFER_SIZE > magic_bytes.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should actually be 2 * BUFFER_SIZE - 1
, to ensure that if the entire magic couldn't fit into the window before shifting the window, it can afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On each windows pass, the cursor is moved by BUFFER_SIZE - magic_bytes.len()
back, so it will contain magic bytes at the boundary. Looking at it now, we actually must move the window by BUFFER_SIZE - magic_bytes.len() + 1
to not count magic bytes exactly at the start of the window twice. The actuall assertion for the window to move should then be BUFFER_SIZE >= magic_bytes.len()
.
pub fn next_back<R: Read + Seek>(&mut self, reader: &mut R) -> ZipResult<Option<u64>> { | ||
loop { | ||
if self.cursor < self.bounds.0 { | ||
// The finder is consumed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should actually be self.cursor <= self.bounds.0
, since we set them equal at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.cusor == self.bounds.0
is a valid state. It is achievable by having data of size BUFFER_SIZE + (BUFFER_SIZE - MAGIC_SIZE + 1) * n
for some n
. What we set equal at the end are the bounds, to essentially make it a cusor in an empty region.
src/spec.rs
Outdated
} | ||
|
||
pub(crate) struct CentralDirectoryEndInfo { | ||
pub eocd: (Zip32CentralDirectoryEnd, u64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split this into two fields for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rather introduce DataWithPosition<T>
to keep it as a single field, because the eocd64
is an Option
and having to match two Option
s when they both must be either Some
or None
at the same time would be tedious.
src/spec.rs
Outdated
pub fn write<T: Write>(self, writer: &mut T) -> ZipResult<()> { | ||
let (block, comment) = self.block_and_comment()?; | ||
block.write(writer)?; | ||
writer.write_all(&comment)?; | ||
Ok(()) | ||
} | ||
|
||
pub fn is_zip64(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this may_be_zip64
instead, because a ZIP32 file may happen to have u16::MAX files or u32::MAX bytes before the central directory.
continue; | ||
} | ||
|
||
// Attempt to find the first CDFH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you mean the last one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this might actually be an error in the algorithm, because what I should be looking for is the first file (to correctly determine the archive offset). No idea how this gets past the tests, I will look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is incorrect, the reason the test pass is that the initial guess is always right. I will add a test case where this is not true.
Edit: fuzz tests don't pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up implementing both directions for the MagicFinder and implemented the correct search direction. I think the only thing remaining is the may_be_zip64 logic for edge case zip32s.
src/spec.rs
Outdated
continue; | ||
} | ||
|
||
// Branch out for zip32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle the case of u16::MAX files in a ZIP32 I mentioned above. This may mean changing is_zip64()
to return a yes/no/maybe enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we support these edge cases, then the function can only return maybe/no, because it cannot verify the zip is actually zip64 locally. Changing the name of the function to may_be_zip64
sounds like the better option.
The tests in CI seem to fail due to clippy lints enforce in the parts of the codebase I did not even touch. The same seems to happen to other PRs in this repository. At a first glance, this is caused by a few clippy defaults being changed in nightly. I will submit another PR to resolve those and then I'll rebase this branch on that one. |
This has helped a user pretty greatly when extracting from an network file share (NFS) - I believe |
Alright, I finally got to finish the edge case ZIP32 detection. This caused the fuzzer to detect some cases where the library would try to allocate too much data. I handled this by adding an EOCD64 consistency check that invalidates the entry if the number of files would not fit in the central directory. If the tests pass, I think all of the features are now implemented. |
In this PR, a new way of finding the EOCD/EOCD64 blocks is introduced. The motivation is to fix the extreme performance regression introduced in cb2d7ab. This PR is also expected to close #231.
What was wrong
In the previous iteration, the
ZipArchive::get_metadata
function scanned the entire contents of the archive (in extreme cases multiple times) and contained some amount of backtracking even in the best case scenario of no prepended junk data. There was also a lot of code duplication in the functions for finding EOCD and EOCD64 blocks.What this PR introduces
A
MagicFinder
pseudo-iterator is implemented for generic magic needle search from the end of a seekable reader. An additionalOptimisticMagicFinder
is implemented for the best case scenario, where the archive offset is either known exactly or is equal to zero, so that no scanning is performed, as it would be unnecessary.The
find_and_parse
methods ofZip32CentralDirectoryEnd
andZip64CentralDirectoryEnd
are replaced with a commonfind_central_directory
function. The function employs the following strategy to locate EOCD and EOCD64 if it is expected to be present:MagicFinder
is used to find EOCD magic bytesThe code for
get_metadata
was simplified, because additional information is now available after finding the EOCDs (the archive offset in particular).Performance
In my internal testing (reading a 44MB ZIP with 30 files to obtain the file count), the extreme performance regression is gone, while still satisfying all of the tests.
Clearly there are still some performance regressions left, but according to my testing, the code path I spent optimizing takes only 0.3ms out of the 13.5ms, so the regression must lie elsewhere.
What remains to be done
ArchiveOffset::FromCentralDirectory
option?Naming conventions for byte blocks
In this PR I use the original naming convention for the different type of blocks (End Of Central Directory instead of Central Directory End, etc.). If this is wrong, please, let me know, I will revert this. If you'd like me to change the rest of the errors to match the official convention, I'd also be happy to do that.
ArchiveOffset::FromCentralDirectory
deprecationThis option was previously respected for ZIP32 only, but the logic did not make much sense to me. GitHub search reveals that there is not public repository using this option. Could you, please, clarify, what the intent with this feature was? The way I see it, it's enough to have an option to do offset detection if the initial guess fails and also to have the
Known
variant to opt-out of the detection mechanism.Archive comments duality
Previously, this PR introduced a breaking change, where the
ZipArchive
comment for ZIP64 is now read from EOCD64 instead of EOCD. Instead, I introduced thezip64_comment
field inShared
and made the field available separately. Now both can be used by the users and it should be clear which is which.