LoadPackageArchive

This driver uses MiniZip to read the ZIP directory information for each .nupkg on NuGet.org as well as the package signature. This information is stored in Azure Table Storage for other drivers to use. This is an optimization to reduce the amount of data downloads from NuGet.org's public APIs in later steps.

If a driver does not need actual package content and only needs the ZIP file listing (relatively small data), it can use the PackageFileService which reads the data produced by this driver.


`CatalogScanDriverType` enum value	`LoadPackageArchive`
Driver implementation	`LoadPackageArchiveDriver`
Processing mode	process latest catalog leaf per package ID and version
Cursor dependencies	V3 package content: this driver needs the .nupkg from the package content resource
Components using driver output	`PackageAssemblyToCsv`: uses ZIP file listing to check for assemblies `PackageAssetToCsv`: uses ZIP file listing to determine package assets `PackageCertificateToCsv`: needs the package signature `PackageCompatibilityToCsv`: needs the ZIP file listing for compatibility computation `PackageContentToCsv`: needs the ZIP file listing to skip packages without the desired files `PackageFileToCsv`: uses ZIP file listing for hashed entry list `PackageSignatureToCsv`: needs the package signature
Temporary storage config	none
Persistent storage config	Table Storage: `PackageArchiveTableName`: ZIP directory and signature bytes are stored using MessagePack and `WideEntityStorageService`
Output CSV tables	none

Algorithm

A batch of catalog leaf items are passed to the driver. For each catalog leaf, MiniZip is used to fetch just the NuGet package (.nupkg) ZIP directory data and the (uncompressed) NuGet package signature. This is done via minimal HTTP HEAD and GET Range requests so that the whole package is not downloaded.

These two pieces of information (ZIP directory and signature) are serialized and compressed using MessagePack and then written into Azure Table Storage as "wide entities". A wide entity is a concept invented for this project so that small blobs can be segmented into Azure Table Storage. This is an alternative to writing blobs to Azure Blob Storage. Wide entities can be read and written in batches whereas blobs cannot. This improves performance and cost but has the drawback of being bounded in size.

The ZIP directory is represented using an MZip format, which is invented by the MiniZip library. It's essentially all bytes of the ZIP directory as well as a file offset in the original ZIP file. This allows a virtual Stream to be created of the ZIP file which allows ZIP reader APIs to fully explore the ZIP directory, but not read any ZIP entry content. The purpose of the format is to minimize the amount of data stored for each ZIP file but allow full inspection of some of the most important data in the ZIP, i.e. the ZIP central directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoadPackageArchive.md

LoadPackageArchive.md

LoadPackageArchive

Algorithm

Files

LoadPackageArchive.md

Latest commit

History

LoadPackageArchive.md

File metadata and controls

LoadPackageArchive

Algorithm