DB v6 distribution approach #2125

wagoodman · 2024-09-17T19:21:02Z

Today the grype DB is distributed via a hosted listing.json file with URLs to DBs, listing out historical entries to N many days. There are a few points here:

The listing file serves two purposes: to find the latest DB and access historical DBs. The former is the primary use case of the listing file, the latter is added weight.
The listing file takes absolute URLs, not relative paths. This makes crafting a listing file not portable between environments, thus, needs to be rebuilt for each environment deployed to.
Distribution definitions are not tied to a DB schema version, and the listing contains entries for all schema versions. This prevents from being able to make breaking changes to the listing file format itself. This also requires more coordination when updating the listing file nightly (upload all databases in a fan-out, then fan in to update the listing -- ideally there would be no need to fan in [which causes lots of fun failure modes to think through]). (this is a grype-db repo concern, but motivates the changes here)
We do not check the DB checksum on start, mostly because this is expensive, but also there have been inefficiencies in this load path too.

Based on these points here are the suggested changes:

we support listings for a single DB schema only -- each new schema will be hosted in a new location.
replace the single listing file with two files: latest.json and history.json, split based on use case. This means that the most common use case (latest.json) is as small as possible, removing pressure from the CDN.
split the db.Curator by use case: DB distribution vs access to an already installed DB.
use xxh64 for DB checksum (not sha256), which is rather quick when checking large DB files
use relative URLs (relative to where the latest.json/history.json files are hosted, not absolute ones). Note: we should still be able to express absolute URLs for operational fallback positions, but this should be the exception, not the norm.

`latest.json` file

{
  "schemaVersion": 6,
  "status": "active",
  "archive": {
    "database": {
      "built": "2024-08-23T12:34:56Z",
      "checksum": "xxhash64:1a2b3c4d5e6f7g8h", 
      "providers": [
        {
          "name": "nvd",
          "compiled": "2024-08-23T08:00:00Z"
        },
        {
          "name": "github",
          "compiled": "2024-08-23T09:00:00Z"
        },
        ...
      ]
    },
    "path": "databases/v6/grype-db_v6_2024-08-23T11:22:22Z_1724213998.tar.gz",
    "checksum": "sha256:dd0e762e39a5905f9a622f00a361b6036c811b33bf9c5139fddaf5013db904d9"
  }
}

This file would describe only a single DB. This also combines the metadata.json and provider-metadata.json concerns (so only metadata.json needs to be packaged into the tar.

There is a status field with possible values:

active: the database is actively being maintained and distributed
deprecated: the database is still being distributed but is approaching end of life. Upgrade grype to avoid future disruptions.
inactive: the database is no longer being distributed. Users must build their own databases or upgrade grype.

`history.json` file

{
  "schemaVersion": 6,
  "status": "active",
  "archives": [
    {
      // same entry as in latest.json for "archive"
    },
    ...
  ]
}

How these distribution files relate to one another...

Another way to look at the contained information and how it is produced/consumed:

metadata.json (output from grype-db build) is made up of a single “database description”... used to generate a latest.json later in the process
- Schema version
- Built time
- Checksum (xxh64)
- List of provider info (name and compiled time)
latest.json (output from grype-db package) is made up of a single “archive description”, schema info, and the contained “database description”... used to populate/update history.json in the future :
- Schema version
- Active
- Archive
  - Path
  - Checksum (sha256)
  - Database description
    - (same as metadata.json, except schema-version is left blank)
history.json is an array of “archive descriptions”, but otherwise is just like latest.json

Comments / open questions

(from earlier conversations with @anchore/tools about this topic)

should we remove the providers data entirely from the listing use case, so that end users must query the DB for this info?
- edit: yes
should we leverage the CDN for archive compression concerns? (over compressing the payload ourselves, or even packaging into a tar)
should we get rid of the metadata.json and require clients to get this kind of information directly from the DB?
- edit: yes

Prototype branch for reference: https://github.com/anchore/grype/tree/db-v6-blob-store

The text was updated successfully, but these errors were encountered:

wagoodman · 2024-11-13T22:46:45Z

Two of the open questions have been addressed an incorporated:

should we get rid of the metadata.json and require clients to get this kind of information directly from the DB?

In one of the latest grype store PRs we've done just this. Now there is a vulnerability.db.checksum file that is generated by grype (not included in the distributed tar), but otherwise there is no metadata.json anymore.

should we remove the providers data entirely from the listing use case, so that end users must query the DB for this info?

the providers information has been removed from the latest.json

wagoodman added the enhancement New feature or request label Sep 17, 2024

wagoodman added this to the DB v6 milestone Sep 17, 2024

anchoretoolsops added this to OSS Sep 17, 2024

wagoodman mentioned this issue Sep 17, 2024

Configure and use DB distribution URLs #2126

Open

wagoodman added the planning high level epic that should be broken into smaller tasks label Sep 17, 2024

This was referenced Sep 17, 2024

Split DB v6 Curator object #2127

Open

Change DB publish workflow to account for V6 anchore/grype-db#387

Open

wagoodman moved this to Ready in OSS Sep 17, 2024

wagoodman self-assigned this Sep 26, 2024

wagoodman moved this from Ready to In Progress in OSS Sep 26, 2024

This was referenced Sep 30, 2024

Add v6 distribution client #2150

Merged

Add v6 DB curator #2151

Merged

wagoodman mentioned this issue Nov 14, 2024

Add processor to workspace state anchore/vunnel#730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB v6 distribution approach #2125

DB v6 distribution approach #2125

wagoodman commented Sep 17, 2024 •

edited

Loading

wagoodman commented Nov 13, 2024 •

edited

Loading

DB v6 distribution approach #2125

DB v6 distribution approach #2125

Comments

wagoodman commented Sep 17, 2024 • edited Loading

latest.json file

history.json file

How these distribution files relate to one another...

Comments / open questions

wagoodman commented Nov 13, 2024 • edited Loading

wagoodman commented Sep 17, 2024 •

edited

Loading

`latest.json` file

`history.json` file

wagoodman commented Nov 13, 2024 •

edited

Loading