feat: Interpret data descriptors when reading zip file from (read, nonseek) stream #197

0xCCF4 · 2024-06-21T07:42:10Z

When reading a zipfile from within a zipfile, the inside zipfile stream is not seekable. Given the case that we are dealing with large files, it might be unfeasible to buffer the whole nested file to RAM. It may be better to stream the zip file instead. Relying on the local file header and data descriptors to determine the entries of the zip file. Completely ignoring the central directory at the file end.

Current project state:

There exists a method for streaming a non-seekable zip file stream.
This method does not handle data descriptors
Zip files with a data descriptor can not be processed by this function

My contribution:

Changed this method to allow parsing a zip file with data descriptors
Added tests for this feature

Current state:

Prepare the library for reading without requiring a Take object
Change the CRC checker to allow setting the CRC to check in the end
Security considerations regarding this zip reading approach
Add a custom reader the looks ahead for a data descriptor and then terminates the reading of a zip file content

Security Considerations:
Reading a zip archive that way without knowing the file size in advance may result in parsing inconsistency. Since the file content of a single file may be attacker controlled, we must assume that an attacker may craft a file that contains a sequence looking like a data descriptor. After that an attacker might put the header of a new zip file entry.
Parsing the file in streaming fashion will return two files. Parsing with the central directory information will results in just one file read. This should be regarded as security risk. Still, allowing to read zip files with data descriptors in a streamed fashion is useful for some applications. I therefore introduced the types UntrustedValue and MaybeUnstrusted. See https://github.com/0xCCF4/zip2/blob/f6b5da958028c5cb90d6de7b5b56a378de52fc95/src/result.rs#L110
If a library user would like to use this streaming functionality this security risk must be explicitly accepted.

Related issues:
#162

…reader: &mut S) that reads the next zipfile entry from a stream by potential parsing the data descriptor This is an alternative method to read a zip file. If possible, use the ZipArchive functions as some information will be missing when reading this manner. This method extends the existing read_zipfile_from_stream method when the stream is seekable. This method is superior to ZipArchive in the special case that the underlying stream implementation must buffer all seen/read data to provide seek back support and the memory consumption must be kept small. This could be the case when reading a ZipFile B.zip nested within a zip file A.zip, that is stored on disk. Since A is seekable when stored as file, A.zip can be read using ZipArchive. Be B a zip file stored in A, then we can read B's content using a decompressing reader. The problem is that the decompressing reader will not be seekable (due to decompression). When one would like to still read contents of B.zip without extracting A to disk, the file B.zip must be buffered in RAM. When using ZipArchive to read B.zip from RAM, the whole B.zip file must be buffered to RAM because the central directory of B.zip is located at the end of the file B.zip. This method will read B.zip from the start of the file and returning the first file entry found in B.zip. After the execution of this function and the ZipFile return value is dropped, the reader will be positioned at the end of the file entry. Since this function will never seek back to before the initial position of the stream when the function was called, the underlying stream implementation may discard, after dropping ZipFile, all buffered data before the current position of the stream. Summarizing: In given scenario, this method must not buffer the whole B.zip file to RAM, but only the first file entry.

# Conflicts: # src/read.rs

…sulate potential unsafe data

…plate T: Read in internal data structures Added UntrustedValue and MaybeUntrusted data types

# Conflicts: # src/read.rs

Before: The CRC32 checksum had to be supplied before the stream is read. Though checked when EOF occurred. Now: The CRC32 checksum can be supplied before starting to read or after finishing reading from the stream.

0xCCF4 · 2024-06-24T12:28:50Z

The changes kind of exploded 🙈 😅 - but now it finally works. Looking forward to a review.
Best regards

Pr0methean

Decompressing readers only exist when the feature flag for their compression method is enabled. I may have more comments later.

src/crc32.rs

Signed-off-by: Chris Hennick <[email protected]>

src/read.rs

Pr0methean

Seems mostly functionally correct, but I have some performance concerns.

src/read.rs

Pr0methean · 2024-07-06T22:41:10Z

src/read.rs

    }

-    let limit_reader = (reader as &'a mut dyn Read).take(result.compressed_size);
+    fn read_a_byte(&mut self) -> io::Result<Option<u8>> {


Reading one byte at a time like this won't be efficient. Instead, read a block into the look-ahead buffer and use memchr::memmem::find to check for a data descriptor signature. (Be sure to leave size_of::<Magic>() bytes in the buffer, in case the descriptor signature straddles two blocks!)

Pr0methean · 2024-07-06T22:44:31Z

src/read.rs

+        if spec::Magic::from_first_le_bytes(&self.look_ahead_buffer)
+            == spec::Magic::DATA_DESCRIPTOR_SIGNATURE
+        {
+            // potentially found a data descriptor
+            // check if size matches
+            let data_descriptor = match ZipDataDescriptor::interpret(&self.look_ahead_buffer) {
+                Ok(data_descriptor) => data_descriptor,
+                Err(_) => return None,
+            };
+            let data_descriptor_size = data_descriptor.compressed_size;
+
+            if data_descriptor_size == self.number_read_total_actual as u32 {
+                return Some(data_descriptor);
+            }
+        }


interpret already checks whether the magic matches.

Suggested change

if spec::Magic::from_first_le_bytes(&self.look_ahead_buffer)

== spec::Magic::DATA_DESCRIPTOR_SIGNATURE

{

// potentially found a data descriptor

// check if size matches

let data_descriptor = match ZipDataDescriptor::interpret(&self.look_ahead_buffer) {

Ok(data_descriptor) => data_descriptor,

Err(_) => return None,

};

let data_descriptor_size = data_descriptor.compressed_size;

if data_descriptor_size == self.number_read_total_actual as u32 {

return Some(data_descriptor);

}

}

let Ok(data_descriptor) = ZipDataDescriptor::interpret(&self.look_ahead_buffer) else {

return None;

};

let data_descriptor_size = data_descriptor.compressed_size;

if data_descriptor_size == self.number_read_total_actual as u32 {

Some(data_descriptor)

} else {

None

}

Check the CRC32 here as well.

This probably needs some architectural changes. The function above operates on the compressed zip file. The CRC is calculated from the uncompressed data. With the current architecture checking CRC in this function is not possible.

Currently I have no good idea to solve this. Still, the CRC is checked, so if an "data descriptor" occurs in the stream - Which is only the case if the file size in the data descriptor is correct - the CRC is likely wrong. In this case only the first half/or part of the file is returned to the user, but then an error is thrown when the CRC check fails; reaching the stream EOF.

src/read.rs

0xCCF4 · 2024-07-08T09:58:08Z

Review todos:

Simple review suggestions
Convert Vec to VecDequeue
Optimize read_a_byte to read more than a byte at once 😄
Check CRC once a potential data descriptor is found

…m' into feature-read-from-seekable-stream

src/read.rs

+        let data_descriptor_size = data_descriptor.compressed_size;
+
+        if data_descriptor_size == self.number_read_total_actual as u32 {
+            // TODO: check CRC32 here as well


src/read.rs

run cargo fmt & clippy

Signed-off-by: Chris Hennick <[email protected]>

src/types.rs

Signed-off-by: Chris Hennick <[email protected]>

0xCCF4 marked this pull request as draft June 21, 2024 07:42

0xCCF4 added 3 commits June 21, 2024 09:44

Merge branch 'refs/heads/master' into feature-read-from-seekable-stream

d146ff3

# Conflicts: # src/read.rs

Merged master to feature branch

7f0d07b

Moved streamed zip read tests to custom test file

59f1327

0xCCF4 mentioned this pull request Jun 21, 2024

Limited seeking backwards when reading a ZipArchive #162

Open

0xCCF4 added 4 commits June 22, 2024 16:18

Added security risk documentation and untrusted value struct to encap…

3bef659

…sulate potential unsafe data

Library does not require Take<Read> anymore but instead accepts a tem…

f6b5da9

…plate T: Read in internal data structures Added UntrustedValue and MaybeUntrusted data types

Merge branch 'refs/heads/master' into feature-read-from-seekable-stream

e36ad61

# Conflicts: # src/read.rs

Completed merge master -> feature branch

47f718b

0xCCF4 changed the title ~~draft: Add means to read a zip file part by part~~ draft: feat: Interpret data descriptors when reading zip file from (read, nonseek) stream Jun 24, 2024

0xCCF4 added 3 commits June 24, 2024 13:42

CRC32 checksum is now late propagated

bf7a030

Before: The CRC32 checksum had to be supplied before the stream is read. Though checked when EOF occurred. Now: The CRC32 checksum can be supplied before starting to read or after finishing reading from the stream.

ZipStream API supports archives with data descriptor

3265477

Run cargo fmt --all

1d3afa9

0xCCF4 marked this pull request as ready for review June 24, 2024 12:19

0xCCF4 changed the title ~~draft: feat: Interpret data descriptors when reading zip file from (read, nonseek) stream~~ feat: Interpret data descriptors when reading zip file from (read, nonseek) stream Jun 24, 2024

Pr0methean requested changes Jul 6, 2024

View reviewed changes

src/crc32.rs Show resolved Hide resolved

src/crc32.rs Show resolved Hide resolved

src/crc32.rs Show resolved Hide resolved

src/crc32.rs Show resolved Hide resolved

src/crc32.rs Show resolved Hide resolved

src/crc32.rs Show resolved Hide resolved

chore: Feature-gate ReadAndSupplyExpectedCRC32 implementations

ad3dbc0

Signed-off-by: Chris Hennick <[email protected]>

Pr0methean added the review in progress label Jul 6, 2024

Merge branch 'master' into feature-read-from-seekable-stream

fc83a70

Signed-off-by: Chris Hennick <[email protected]>

Pr0methean reviewed Jul 6, 2024

View reviewed changes

src/read.rs Outdated Show resolved Hide resolved

Pr0methean requested changes Jul 6, 2024

View reviewed changes

Pr0methean added Please address review comments Some review comments are still open. and removed review in progress labels Jul 6, 2024

0xCCF4 added 2 commits July 8, 2024 11:59

refactor: applied simple review suggestions and cargo fmt & clippy

6cc20f5

Merge remote-tracking branch 'origin/feature-read-from-seekable-strea…

c61a683

…m' into feature-read-from-seekable-stream

github-advanced-security bot found potential problems Jul 9, 2024

View reviewed changes

fix: completed merge of xz decoder into feature branch

f0cb9f2

0xCCF4 and others added 4 commits July 15, 2024 12:43

Merge branch 'refs/heads/master' into feature-read-from-seekable-stream

1ecad02

perf: use vecdequeue for look ahead buffer

549d1da

run cargo fmt & clippy

Merge branch 'master' into feature-read-from-seekable-stream

7cd7937

Signed-off-by: Chris Hennick <[email protected]>

Merge branch 'master' into feature-read-from-seekable-stream

a530f52

Signed-off-by: Chris Hennick <[email protected]>

Pr0methean reviewed Jul 19, 2024

View reviewed changes

src/types.rs Show resolved Hide resolved

fix: FixedSizeBlock must extend Pod

6022136

Signed-off-by: Chris Hennick <[email protected]>

Pr0methean added the Please fix rebase conflicts Please rebase this PR against master and fix the conflicts. label Jul 27, 2024

Pr0methean added 2 commits August 2, 2024 18:41

Merge branch 'master' into feature-read-from-seekable-stream

a9b7bc6

Signed-off-by: Chris Hennick <[email protected]>

Merge branch 'master' into feature-read-from-seekable-stream

ecf5588

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Interpret data descriptors when reading zip file from (read, nonseek) stream #197

feat: Interpret data descriptors when reading zip file from (read, nonseek) stream #197

0xCCF4 commented Jun 21, 2024 •

edited

Loading

0xCCF4 commented Jun 24, 2024

Pr0methean left a comment

Pr0methean left a comment

Pr0methean Jul 6, 2024

Pr0methean Jul 6, 2024

Pr0methean Jul 6, 2024

0xCCF4 Jul 8, 2024

0xCCF4 commented Jul 8, 2024 •

edited

Loading

feat: Interpret data descriptors when reading zip file from (read, nonseek) stream #197

Are you sure you want to change the base?

feat: Interpret data descriptors when reading zip file from (read, nonseek) stream #197

Conversation

0xCCF4 commented Jun 21, 2024 • edited Loading

0xCCF4 commented Jun 24, 2024

Pr0methean left a comment

Choose a reason for hiding this comment

Pr0methean left a comment

Choose a reason for hiding this comment

Pr0methean Jul 6, 2024

Choose a reason for hiding this comment

Pr0methean Jul 6, 2024

Choose a reason for hiding this comment

Pr0methean Jul 6, 2024

Choose a reason for hiding this comment

0xCCF4 Jul 8, 2024

Choose a reason for hiding this comment

0xCCF4 commented Jul 8, 2024 • edited Loading

0xCCF4 commented Jun 21, 2024 •

edited

Loading

0xCCF4 commented Jul 8, 2024 •

edited

Loading