This driver uses MiniZip to read the ZIP directory information for each .nupkg on NuGet.org as well as the package signature. This information is stored in Azure Table Storage for other drivers to use. This is an optimization to reduce the amount of data downloads from NuGet.org's public APIs in later steps.
If a driver does not need actual package content and only needs the ZIP file listing (relatively small data), it can use the PackageFileService
which reads the data produced by this driver.
CatalogScanDriverType enum value |
LoadPackageArchive |
Driver implementation | LoadPackageArchiveDriver |
Processing mode | process latest catalog leaf per package ID and version |
Cursor dependencies | V3 package content: this driver needs the .nupkg from the package content resource |
Components using driver output | PackageAssemblyToCsv : uses ZIP file listing to check for assembliesPackageAssetToCsv : uses ZIP file listing to determine package assetsPackageCertificateToCsv : needs the package signaturePackageCompatibilityToCsv : needs the ZIP file listing for compatibility computationPackageContentToCsv : needs the ZIP file listing to skip packages without the desired filesPackageFileToCsv : uses ZIP file listing for hashed entry listPackageSignatureToCsv : needs the package signature |
Temporary storage config | none |
Persistent storage config | Table Storage:PackageArchiveTableName : ZIP directory and signature bytes are stored using MessagePack and WideEntityStorageService |
Output CSV tables | none |
A batch of catalog leaf items are passed to the driver. For each catalog leaf, MiniZip is used to fetch just the NuGet package (.nupkg) ZIP directory data and the (uncompressed) NuGet package signature. This is done via minimal HTTP HEAD and GET Range
requests so that the whole package is not downloaded.
These two pieces of information (ZIP directory and signature) are serialized and compressed using MessagePack and then written into Azure Table Storage as "wide entities". A wide entity is a concept invented for this project so that small blobs can be segmented into Azure Table Storage. This is an alternative to writing blobs to Azure Blob Storage. Wide entities can be read and written in batches whereas blobs cannot. This improves performance and cost but has the drawback of being bounded in size.
The ZIP directory is represented using an MZip
format, which is invented by the MiniZip library. It's essentially all bytes of the ZIP directory as well as a file offset in the original ZIP file. This allows a virtual Stream
to be created of the ZIP file which allows ZIP reader APIs to fully explore the ZIP directory, but not read any ZIP entry content. The purpose of the format is to minimize the amount of data stored for each ZIP file but allow full inspection of some of the most important data in the ZIP, i.e. the ZIP central directory.