Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

RisaI
Copy link
Contributor

@RisaI RisaI commented Sep 27, 2024

In this PR, a new way of finding the EOCD/EOCD64 blocks is introduced. The motivation is to fix the extreme performance regression introduced in cb2d7ab. This PR is also expected to close #231.

What was wrong

In the previous iteration, the ZipArchive::get_metadata function scanned the entire contents of the archive (in extreme cases multiple times) and contained some amount of backtracking even in the best case scenario of no prepended junk data. There was also a lot of code duplication in the functions for finding EOCD and EOCD64 blocks.

What this PR introduces

A MagicFinder pseudo-iterator is implemented for generic magic needle search from the end of a seekable reader. An additional OptimisticMagicFinder is implemented for the best case scenario, where the archive offset is either known exactly or is equal to zero, so that no scanning is performed, as it would be unnecessary.

The find_and_parse methods of Zip32CentralDirectoryEnd and Zip64CentralDirectoryEnd are replaced with a common find_central_directory function. The function employs the following strategy to locate EOCD and EOCD64 if it is expected to be present:

  • A MagicFinder is used to find EOCD magic bytes
  • The EOCD block is parsed and it's internal validity is checked, discarding the entry if it's invalid
  • It is determined whether the EOCD contents indicate the file is ZIP64
  • In the ZIP32 case:
    • An empty archive is assumed to always be correct and is returned
    • A non empty archive attempts to look for the first CDFH between the relative offset in EOCD and the EOCD offset
    • If found, the archive offset is determined and the entry is returned, otherwise it is discarded
  • In the ZIP64 case:
    • The EOCD64 Locator is parsed at it's expected position
    • Its internal validity is checked
    • EOCD64 magic bytes are searched for between the relative offset in the Locator and Locator's own offset
    • If EOCD64 is found and is internally valid, the archive offset is determined and the EOCD + EOCD64 are returned

The code for get_metadata was simplified, because additional information is now available after finding the EOCDs (the archive offset in particular).

Performance

In my internal testing (reading a 44MB ZIP with 30 files to obtain the file count), the extreme performance regression is gone, while still satisfying all of the tests.

2.1.3: ~6.2ms
2.2.0: ~14s
This PR: ~13.5ms

Clearly there are still some performance regressions left, but according to my testing, the code path I spent optimizing takes only 0.3ms out of the 13.5ms, so the regression must lie elsewhere.

What remains to be done

  • get feedback for the EOCD/CFHD naming convention in ZipErrors
  • deprecating the ArchiveOffset::FromCentralDirectory option?
  • reconciling EOCD and EOCD64 archive comments

Naming conventions for byte blocks

In this PR I use the original naming convention for the different type of blocks (End Of Central Directory instead of Central Directory End, etc.). If this is wrong, please, let me know, I will revert this. If you'd like me to change the rest of the errors to match the official convention, I'd also be happy to do that.

ArchiveOffset::FromCentralDirectory deprecation

This option was previously respected for ZIP32 only, but the logic did not make much sense to me. GitHub search reveals that there is not public repository using this option. Could you, please, clarify, what the intent with this feature was? The way I see it, it's enough to have an option to do offset detection if the initial guess fails and also to have the Known variant to opt-out of the detection mechanism.

Archive comments duality

Previously, this PR introduced a breaking change, where the ZipArchive comment for ZIP64 is now read from EOCD64 instead of EOCD. Instead, I introduced the zip64_comment field in Shared and made the field available separately. Now both can be used by the users and it should be clear which is which.

Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is on the right track; please fix the failing tests.

@@ -56,6 +54,8 @@ pub(crate) mod zip_archive {
// This isn't yet used anywhere, but it is here for use cases in the future.
#[allow(dead_code)]
pub(super) config: super::Config,
pub(crate) comment: Box<[u8]>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an Option as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary. Technically, the comment section is always present, however it can have length zero. This is a state that can be represented by an empty box slice. The ZIP64 comment, however, might not be present at all (when it's actually a ZIP32), so there the distinction makes sense.

src/read.rs Outdated
Ok((Rc::try_unwrap(footer).unwrap(), shared.build()))
pub(crate) fn get_metadata(config: Config, reader: &mut R) -> ZipResult<Shared> {
// Find the EOCD and possibly EOCD64 entries and determine the archive offset.
let cde = spec::find_central_directory(reader, config.archive_offset)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if something that looks like a valid EOCD or EOCD64 block, but doesn't have a valid central directory in front of it and thus fails try_from, is included in the file comment of a valid ZIP file? We should keep looking for the real one in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wrap this in a loop and allow the find_central_directory function to continue from the previous EOCD candidate backwards.


// Smaller buffer size would be unable to locate bytes.
// Equal buffer size would stall (the window could not be moved).
debug_assert!(BUFFER_SIZE > magic_bytes.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should actually be 2 * BUFFER_SIZE - 1, to ensure that if the entire magic couldn't fit into the window before shifting the window, it can afterward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On each windows pass, the cursor is moved by BUFFER_SIZE - magic_bytes.len() back, so it will contain magic bytes at the boundary. Looking at it now, we actually must move the window by BUFFER_SIZE - magic_bytes.len() + 1 to not count magic bytes exactly at the start of the window twice. The actuall assertion for the window to move should then be BUFFER_SIZE >= magic_bytes.len().

src/read/magic_finder.rs Outdated Show resolved Hide resolved
pub fn next_back<R: Read + Seek>(&mut self, reader: &mut R) -> ZipResult<Option<u64>> {
loop {
if self.cursor < self.bounds.0 {
// The finder is consumed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should actually be self.cursor <= self.bounds.0, since we set them equal at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.cusor == self.bounds.0 is a valid state. It is achievable by having data of size BUFFER_SIZE + (BUFFER_SIZE - MAGIC_SIZE + 1) * n for some n. What we set equal at the end are the bounds, to essentially make it a cusor in an empty region.

src/spec.rs Outdated
}

pub(crate) struct CentralDirectoryEndInfo {
pub eocd: (Zip32CentralDirectoryEnd, u64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this into two fields for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rather introduce DataWithPosition<T> to keep it as a single field, because the eocd64 is an Option and having to match two Options when they both must be either Some or None at the same time would be tedious.

src/spec.rs Show resolved Hide resolved
src/spec.rs Outdated
pub fn write<T: Write>(self, writer: &mut T) -> ZipResult<()> {
let (block, comment) = self.block_and_comment()?;
block.write(writer)?;
writer.write_all(&comment)?;
Ok(())
}

pub fn is_zip64(&self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this may_be_zip64 instead, because a ZIP32 file may happen to have u16::MAX files or u32::MAX bytes before the central directory.

continue;
}

// Attempt to find the first CDFH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you mean the last one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this might actually be an error in the algorithm, because what I should be looking for is the first file (to correctly determine the archive offset). No idea how this gets past the tests, I will look into this.

Copy link
Contributor Author

@RisaI RisaI Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is incorrect, the reason the test pass is that the initial guess is always right. I will add a test case where this is not true.

Edit: fuzz tests don't pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up implementing both directions for the MagicFinder and implemented the correct search direction. I think the only thing remaining is the may_be_zip64 logic for edge case zip32s.

src/spec.rs Outdated
continue;
}

// Branch out for zip32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle the case of u16::MAX files in a ZIP32 I mentioned above. This may mean changing is_zip64() to return a yes/no/maybe enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support these edge cases, then the function can only return maybe/no, because it cannot verify the zip is actually zip64 locally. Changing the name of the function to may_be_zip64 sounds like the better option.

@RisaI
Copy link
Contributor Author

RisaI commented Oct 20, 2024

The tests in CI seem to fail due to clippy lints enforce in the parts of the codebase I did not even touch. The same seems to happen to other PRs in this repository. At a first glance, this is caused by a few clippy defaults being changed in nightly. I will submit another PR to resolve those and then I'll rebase this branch on that one.

@wolfv
Copy link

wolfv commented Oct 31, 2024

This has helped a user pretty greatly when extracting from an network file share (NFS) - I believe seek's are very expensive on NFS. Here is the before and after: prefix-dev/rattler-build#1045 (comment)

@RisaI
Copy link
Contributor Author

RisaI commented Nov 5, 2024

Alright, I finally got to finish the edge case ZIP32 detection. This caused the fuzzer to detect some cases where the library would try to allocate too much data. I handled this by adding an EOCD64 consistency check that invalidates the entry if the number of files would not fit in the central directory. If the tests pass, I think all of the features are now implemented.

@RisaI RisaI requested a review from Pr0methean November 5, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned
3 participants