fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247

RisaI · 2024-09-27T16:58:36Z

In this PR, a new way of finding the EOCD/EOCD64 blocks is introduced. The motivation is to fix the extreme performance regression introduced in cb2d7ab. This PR is also expected to close #231.

What was wrong

In the previous iteration, the ZipArchive::get_metadata function scanned the entire contents of the archive (in extreme cases multiple times) and contained some amount of backtracking even in the best case scenario of no prepended junk data. There was also a lot of code duplication in the functions for finding EOCD and EOCD64 blocks.

What this PR introduces

A MagicFinder pseudo-iterator is implemented for generic magic needle search from the end of a seekable reader. An additional OptimisticMagicFinder is implemented for the best case scenario, where the archive offset is either known exactly or is equal to zero, so that no scanning is performed, as it would be unnecessary.

The find_and_parse methods of Zip32CentralDirectoryEnd and Zip64CentralDirectoryEnd are replaced with a common find_central_directory function. The function employs the following strategy to locate EOCD and EOCD64 if it is expected to be present:

A MagicFinder is used to find EOCD magic bytes
The EOCD block is parsed and it's internal validity is checked, discarding the entry if it's invalid
It is determined whether the EOCD contents indicate the file is ZIP64
In the ZIP32 case:
- An empty archive is assumed to always be correct and is returned
- A non empty archive attempts to look for the first CDFH between the relative offset in EOCD and the EOCD offset
- If found, the archive offset is determined and the entry is returned, otherwise it is discarded
In the ZIP64 case:
- The EOCD64 Locator is parsed at it's expected position
- Its internal validity is checked
- EOCD64 magic bytes are searched for between the relative offset in the Locator and Locator's own offset
- If EOCD64 is found and is internally valid, the archive offset is determined and the EOCD + EOCD64 are returned

The code for get_metadata was simplified, because additional information is now available after finding the EOCDs (the archive offset in particular).

Performance

In my internal testing (reading a 44MB ZIP with 30 files to obtain the file count), the extreme performance regression is gone, while still satisfying all of the tests.

2.1.3: ~6.2ms
2.2.0: ~14s
This PR: ~13.5ms

Clearly there are still some performance regressions left, but according to my testing, the code path I spent optimizing takes only 0.3ms out of the 13.5ms, so the regression must lie elsewhere.

What remains to be done

get feedback for the EOCD/CFHD naming convention in ZipErrors
deprecating the ArchiveOffset::FromCentralDirectory option?
reconciling EOCD and EOCD64 archive comments

Naming conventions for byte blocks

In this PR I use the original naming convention for the different type of blocks (End Of Central Directory instead of Central Directory End, etc.). If this is wrong, please, let me know, I will revert this. If you'd like me to change the rest of the errors to match the official convention, I'd also be happy to do that.

`ArchiveOffset::FromCentralDirectory` deprecation

This option was previously respected for ZIP32 only, but the logic did not make much sense to me. GitHub search reveals that there is not public repository using this option. Could you, please, clarify, what the intent with this feature was? The way I see it, it's enough to have an option to do offset detection if the initial guess fails and also to have the Known variant to opt-out of the detection mechanism.

Archive comments duality

Previously, this PR introduced a breaking change, where the ZipArchive comment for ZIP64 is now read from EOCD64 instead of EOCD. Instead, I introduced the zip64_comment field in Shared and made the field available separately. Now both can be used by the users and it should be clear which is which.

Pr0methean

Looks like this is on the right track; please fix the failing tests.

Pr0methean · 2024-10-19T22:11:31Z

src/read.rs

@@ -56,6 +54,8 @@ pub(crate) mod zip_archive {
        // This isn't yet used anywhere, but it is here for use cases in the future.
        #[allow(dead_code)]
        pub(super) config: super::Config,
+        pub(crate) comment: Box<[u8]>,


Can this be an Option as well?

I think this is unnecessary. Technically, the comment section is always present, however it can have length zero. This is a state that can be represented by an empty box slice. The ZIP64 comment, however, might not be present at all (when it's actually a ZIP32), so there the distinction makes sense.

Pr0methean · 2024-10-19T22:17:33Z

src/read.rs

-        Ok((Rc::try_unwrap(footer).unwrap(), shared.build()))
+    pub(crate) fn get_metadata(config: Config, reader: &mut R) -> ZipResult<Shared> {
+        // Find the EOCD and possibly EOCD64 entries and determine the archive offset.
+        let cde = spec::find_central_directory(reader, config.archive_offset)?;


What happens if something that looks like a valid EOCD or EOCD64 block, but doesn't have a valid central directory in front of it and thus fails try_from, is included in the file comment of a valid ZIP file? We should keep looking for the real one in that case.

I'll wrap this in a loop and allow the find_central_directory function to continue from the previous EOCD candidate backwards.

Pr0methean · 2024-10-19T22:25:42Z

src/read/magic_finder.rs

+
+        // Smaller buffer size would be unable to locate bytes.
+        // Equal buffer size would stall (the window could not be moved).
+        debug_assert!(BUFFER_SIZE > magic_bytes.len());


Should actually be 2 * BUFFER_SIZE - 1, to ensure that if the entire magic couldn't fit into the window before shifting the window, it can afterward.

On each windows pass, the cursor is moved by BUFFER_SIZE - magic_bytes.len() back, so it will contain magic bytes at the boundary. Looking at it now, we actually must move the window by BUFFER_SIZE - magic_bytes.len() + 1 to not count magic bytes exactly at the start of the window twice. The actuall assertion for the window to move should then be BUFFER_SIZE >= magic_bytes.len().

src/read/magic_finder.rs

Pr0methean · 2024-10-19T22:30:15Z

src/read/magic_finder.rs

+    pub fn next_back<R: Read + Seek>(&mut self, reader: &mut R) -> ZipResult<Option<u64>> {
+        loop {
+            if self.cursor < self.bounds.0 {
+                // The finder is consumed


Should actually be self.cursor <= self.bounds.0, since we set them equal at the end.

self.cusor == self.bounds.0 is a valid state. It is achievable by having data of size BUFFER_SIZE + (BUFFER_SIZE - MAGIC_SIZE + 1) * n for some n. What we set equal at the end are the bounds, to essentially make it a cusor in an empty region.

Pr0methean · 2024-10-19T22:37:21Z

src/spec.rs

+}
+
+pub(crate) struct CentralDirectoryEndInfo {
+    pub eocd: (Zip32CentralDirectoryEnd, u64),


Split this into two fields for readability.

I'll rather introduce DataWithPosition<T> to keep it as a single field, because the eocd64 is an Option and having to match two Options when they both must be either Some or None at the same time would be tedious.

src/spec.rs

Pr0methean · 2024-10-19T22:42:04Z

src/spec.rs

    pub fn write<T: Write>(self, writer: &mut T) -> ZipResult<()> {
        let (block, comment) = self.block_and_comment()?;
        block.write(writer)?;
        writer.write_all(&comment)?;
        Ok(())
    }
+
+    pub fn is_zip64(&self) -> bool {


Call this may_be_zip64 instead, because a ZIP32 file may happen to have u16::MAX files or u32::MAX bytes before the central directory.

Pr0methean · 2024-10-19T22:45:53Z

src/spec.rs

+                continue;
+            }
+
+            // Attempt to find the first CDFH


Don't you mean the last one?

Yeah, this might actually be an error in the algorithm, because what I should be looking for is the first file (to correctly determine the archive offset). No idea how this gets past the tests, I will look into this.

Yes, this is incorrect, the reason the test pass is that the initial guess is always right. I will add a test case where this is not true.

Edit: fuzz tests don't pass.

I ended up implementing both directions for the MagicFinder and implemented the correct search direction. I think the only thing remaining is the may_be_zip64 logic for edge case zip32s.

Pr0methean · 2024-10-19T22:48:36Z

src/spec.rs

+            continue;
+        }
+
+        // Branch out for zip32


Handle the case of u16::MAX files in a ZIP32 I mentioned above. This may mean changing is_zip64() to return a yes/no/maybe enum.

If we support these edge cases, then the function can only return maybe/no, because it cannot verify the zip is actually zip64 locally. Changing the name of the function to may_be_zip64 sounds like the better option.

RisaI · 2024-10-20T09:32:06Z

The tests in CI seem to fail due to clippy lints enforce in the parts of the codebase I did not even touch. The same seems to happen to other PRs in this repository. At a first glance, this is caused by a few clippy defaults being changed in nightly. I will submit another PR to resolve those and then I'll rebase this branch on that one.

…to write

wolfv · 2024-10-31T13:43:28Z

This has helped a user pretty greatly when extracting from an network file share (NFS) - I believe seek's are very expensive on NFS. Here is the before and after: prefix-dev/rattler-build#1045 (comment)

…earch region end

RisaI · 2024-11-05T12:50:13Z

Alright, I finally got to finish the edge case ZIP32 detection. This caused the fuzzer to detect some cases where the library would try to allocate too much data. I handled this by adding an EOCD64 consistency check that invalidates the entry if the number of files would not fit in the central directory. If the tests pass, I think all of the features are now implemented.

Pr0methean reviewed Oct 19, 2024

View reviewed changes

fix: resolve clippy warning in nightly

605f243

RisaI force-pushed the eocd_rewrite branch from cb97144 to e6141ac Compare October 20, 2024 10:06

RisaI added 10 commits October 20, 2024 15:41

wip: major rework of cde location

1c00e11

wip: rework CDE lookup

157da75

refactor: magic finder, eocd lookup retry

a982a59

wip: handle empty zips

15bd50a

fix: satisfy tests, add documentation

91be415

chore: remove unused dependencies

10ca275

feat: support both zip32 and zip64 comments

9b5a987

feat: add zip64 comment functions to ZipWriter

eed5788

fix: first pass on maintainer comments

5050e1f

fix: continue searching for EOCD when the central directory is invalid

63b959b

RisaI force-pushed the eocd_rewrite branch from e6141ac to 63b959b Compare October 20, 2024 13:41

RisaI added 6 commits October 20, 2024 15:52

chore: satisfy clippy lints

4e6700a

chore: satisfy style_and_docs

0f7b326

feat: support both directions in MagicFinder, correctly find first CDFH

37a431b

fix: more checks to EOCD parsing, move comment size error from parse …

eb0a2b8

…to write

fix: use saturating add when checking eocd64 record_size upper bound

bc27ed0

fix: correctly handle mid window offsets in forward mode

1435707

sylvestre mentioned this pull request Oct 22, 2024

Bump zip from 0.6.6 to 2.1.3 mozilla/sccache#2227

Open

smklein mentioned this pull request Oct 31, 2024

Upgrade zip to 2.1.3 oxidecomputer/omicron#6964

Merged

RisaI added 4 commits November 5, 2024 11:15

fix: compare maximum possible comment length against file size, not s…

3c04567

…earch region end

feat: handle zip64 detection as a hint

8d542b6

fix: detect oversized central directories when locating EOCD64

300810f

fix: oopsie

24ad65f

RisaI requested a review from Pr0methean November 5, 2024 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247

fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247

RisaI commented Sep 27, 2024 •

edited

Loading

Pr0methean left a comment

Pr0methean Oct 19, 2024

RisaI Oct 20, 2024

Pr0methean Oct 19, 2024

RisaI Oct 20, 2024

Pr0methean Oct 19, 2024

RisaI Oct 20, 2024

Pr0methean Oct 19, 2024

RisaI Oct 20, 2024

Pr0methean Oct 19, 2024

RisaI Oct 20, 2024

Pr0methean Oct 19, 2024

Pr0methean Oct 19, 2024

RisaI Oct 20, 2024

RisaI Oct 20, 2024 •

edited

Loading

RisaI Oct 20, 2024

Pr0methean Oct 19, 2024

RisaI Oct 20, 2024

RisaI commented Oct 20, 2024

wolfv commented Oct 31, 2024

RisaI commented Nov 5, 2024

fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247

Are you sure you want to change the base?

fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247

Conversation

RisaI commented Sep 27, 2024 • edited Loading

What was wrong

What this PR introduces

Performance

What remains to be done

Naming conventions for byte blocks

ArchiveOffset::FromCentralDirectory deprecation

Archive comments duality

Pr0methean left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RisaI Oct 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RisaI commented Oct 20, 2024

wolfv commented Oct 31, 2024

RisaI commented Nov 5, 2024

RisaI commented Sep 27, 2024 •

edited

Loading

`ArchiveOffset::FromCentralDirectory` deprecation

RisaI Oct 20, 2024 •

edited

Loading