Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 9.4 KB

LoadPackageArchive.md

File metadata and controls

24 lines (17 loc) · 9.4 KB

LoadPackageArchive

This driver uses MiniZip to read the ZIP directory information for each .nupkg on NuGet.org as well as the package signature. This information is stored in Azure Table Storage for other drivers to use. This is an optimization to reduce the amount of data downloads from NuGet.org's public APIs in later steps.

If a driver does not need actual package content and only needs the ZIP file listing (relatively small data), it can use the PackageFileService which reads the data produced by this driver.

CatalogScanDriverType enum value LoadPackageArchive
Driver implementation LoadPackageArchiveDriver
Processing mode process latest catalog leaf per package ID and version
Cursor dependencies V3 package content: this driver needs the .nupkg from the package content resource
Components using driver output PackageAssemblyToCsv: uses ZIP file listing to check for assemblies
PackageAssetToCsv: uses ZIP file listing to determine package assets
PackageCertificateToCsv: needs the package signature
PackageCompatibilityToCsv: needs the ZIP file listing for compatibility computation
PackageContentToCsv: needs the ZIP file listing to skip packages without the desired files
PackageFileToCsv: uses ZIP file listing for hashed entry list
PackageSignatureToCsv: needs the package signature
Temporary storage config none
Persistent storage config Table Storage:
PackageArchiveTableName: ZIP directory and signature bytes are stored using MessagePack and WideEntityStorageService
Output CSV tables none

Algorithm

A batch of catalog leaf items are passed to the driver. For each catalog leaf, MiniZip is used to fetch just the NuGet package (.nupkg) ZIP directory data and the (uncompressed) NuGet package signature. This is done via minimal HTTP HEAD and GET Range requests so that the whole package is not downloaded.

These two pieces of information (ZIP directory and signature) are serialized and compressed using MessagePack and then written into Azure Table Storage as "wide entities". A wide entity is a concept invented for this project so that small blobs can be segmented into Azure Table Storage. This is an alternative to writing blobs to Azure Blob Storage. Wide entities can be read and written in batches whereas blobs cannot. This improves performance and cost but has the drawback of being bounded in size.

The ZIP directory is represented using an MZip format, which is invented by the MiniZip library. It's essentially all bytes of the ZIP directory as well as a file offset in the original ZIP file. This allows a virtual Stream to be created of the ZIP file which allows ZIP reader APIs to fully explore the ZIP directory, but not read any ZIP entry content. The purpose of the format is to minimize the amount of data stored for each ZIP file but allow full inspection of some of the most important data in the ZIP, i.e. the ZIP central directory.