Merritt Collection Health Analysis and Reporting

Overview

As evidenced by our recent recertification with CoreTrustSeal, in-depth curation across the wide range of collections being preserved in Merritt is a task that stands to benefit from the assistance of an automated process capable of providing regular reports to depositors with key insights into collection composition.

Unlike a traditional curation workflow, where archivists work to define metadata and Submission Information Package (SIP) composition in advance of submission to a repository, enabling the generation of a collection “health” report will provide a baseline status for existing collections whose establishment may or may not predate current depositors at the variety of libraries and memory organizations that use Merritt as a digital preservation repository.

The Merritt repository has been operational for over ten years. In its initial years, although object guidelines were provided by CDL, many objects that do not adhere to these were submitted. For example, an object may incorporate object-level ERC metadata but lacks internal metadata sidecar files that serve to further define its collection context and provenance. A retroactive process through which such characteristics are flagged will assist owners to assess whether a specific collection is in need of additional curation, resulting in the creation of new object versions.

We intend the collection health check process to examine a number of file and object conditions when it is executed. Checks will be based on both file and object datasets, each of which will gather characteristics of the data type at hand.

The definition of these conditions are intended to incorporate both internal best practices as defined at campus libraries and the CDL, as well as external community resources such as the Library of Congress Recommended Formats Statement (RFS) and the Digital Preservation Coalition’s Bit-List. Community resources provide input on preferred and acceptable formats within high-level categories, as well as cite a range of “endangered” file formats. While others, such as the DigiPres Commons Formats and MIME Types lists provide references linked to backing information sourced from WikiData and Apache Tika.

Use Cases

Configure

Collection check configuration:

Enable the programmatic configuration of collection checks including result statuses: pass, fail, warn, skip, info.
Enable configuration of mime type usage.

Enhance

Collection check enhancements:

Provide the ability to enhance collection checks over time according to campus partner feedback.

Examine and Report: Files

Collection analysis at the file level:

Enable data extraction at the file level for analysis.
Provide for bulk extraction as well as daily extraction.
Generate collection-specific report with results of analysis.

Examine and Report: Objects

Collection analysis at the object level:

Enable data extraction at the object level for analysis.
Provide for bulk extraction as well as daily extraction.
Generate collection-specific report with results of analysis.

Usage

The collection health process will center on scanning selected collections for depositors. As it entails generating datasets based on both collection files and objects, it is preferred from a compute and infrastructure perspective to scan individual collections on request, rather than scan all holdings that reside in the repository.

Collection Health Checks

At the core of the health check process is a series of customizable candidate tests that are to be run on the file and object datasets generated for a collection. These tests are enumerated in an easily editable YAML-based analysis schema which defines the conditions for whether a test passes or fails, or a warning or informational status message is rendered. Examples of checks to run include but are not limited to:

File level

MIME type sustainability
Filename validation
File extension validation
Empty file detection

Object level

Presence of meaningful object-level ERC metadata
Presence of sidecar metadata file(s)
Local ID conforms to established collection-specific convention
Presence and count of content files
Version count and range

Implementation Strategy

Communications:

Messaging regarding the Collection Health project to take place:
- Directly with campus partners
- Through DPLG overview
- Use and review of file level data dashboard

Prototyping:

File level data extraction and visualization in OpenSearch. Extract file level data and generate a dashboard in OpenSearch that reveals this for review.
Object level data extraction
- Storage of object data as JSON in MySQL records
- Process to update object data records in OpenSearch

Leveraging OpenSearch

The OpenSearch frontend will provide the visual analysis tooling necessary to visualize and build dashboards within for communicating health check reports to campus partners. Integration points and uses of OpenSearch infrastructure include but are not limited to:

Generating JSON files from MySQL records
Capturing collection name, owner, mime-type, bytes, etc.
Creating OpenSearch visualizations
Linking visualizations through a dashboard

Pre-processing data to:

Enable the ability for us to recognize specific mime-types according to preferred, acceptable formats in the LOC RFS
Flag outliers mime-types in a collection
Recognize objects that:
- Have metadata sidecar files
- Meaningful object-level ERC metadata
Recognize files that:
- Are within a typical byte size (file and object basis)
- Adhere to naming conventions in a collection

Provide a means to:

Review collection report results
Annotate/comment on results
Lend enough information for owners to make decisions on file migration

Collection Health Object Analysis

Understanding the Object Health Data Model
Object Resolution

Set environment (DEV)

The DEV version of this application should run against a docker stack that uses the Merritt DEV database.

docker-compose up -d

New Database Table

Copy to https://github.com/CDLUC3/mrt-admin-lambda/blob/main/merrit-billing/schema.sql when complete.

/*
DROP TABLE IF EXISTS object_health_json;
*/
CREATE TABLE object_health_json (
  inv_object_id int,
  build json,
  build_updated datetime,
  analysis json,
  analysis_updated datetime,
  tests json,
  tests_updated datetime,
  exported datetime,
  UNIQUE INDEX object_id(inv_object_id),
  INDEX i_build(build_updated),
  INDEX i_analysis(analysis_updated),
  INDEX i_tests(tests_updated),
  INDEX i_exported(exported)
);

Install

bundle install

Invocation

bundle exec ruby object_health.rb

System Design

Initial Analysis - Built from RDS Data

  graph TD;
      subgraph Object Health Publishing Process
        GATHER(Gather Objects)
        BUILD(Build Objects)
        ANALYZE(Analyze Objects)
        TEST(Test Objects)
        Publish(Publish Changes)
        CODE["Analysis Code
          - Rules File - Yaml
          - Object Analysis Tasks
          - Object Tests
        "]
      end
      subgraph InventoryDatabase
        INVO>inv.inv_objects]
        OM[/Object Metadata/]
        OF[/Object Files/]
      end
      subgraph Billing Database 
        subgraph billing.object_health_json
          JB[/Object Build Json/]
          JA[/Object Analysis Json/]
          JT[/Object Tests Json/]
        end
      end
      subgraph OpenSearch
        OSOH[\OpenSearch Object Health Index\]
      end
      INVO-->GATHER
      OM-.->GATHER
      OF-.->GATHER
      GATHER-->BUILD
      GATHER-->ANALYZE
      GATHER-->TEST
      BUILD<-->JB
      ANALYZE<-->JA
      TEST<-->JT
      GATHER-->Publish
      Publish-->OSOH

Loading

Statistical Analysis

This analysis will require complex queries to run weekly to support the analysis.

  graph TD;
      subgraph Object Health Publishing Process
        GATHER(Gather Objects)
      end
      subgraph InventoryDatabase
        OF[/Object Files/]
      end
      subgraph Billing Database 
        AQ[/"Analysis Queries (run weekly from INV DB)
        - duplicate checksum
        - statistically unusual file size"/]
      end
      OF-.->GATHER
      AQC(Analysis Queries Run by Cron)
      AQC-->AQ
      OF-->AQC
      AQ-->GATHER

Loading

Bitstream Analysis

These components will be more compuationally expensive to implement.
The results of these analyses should feed into the existing Object Health process.

  graph TD;
      subgraph Object Health Publishing Process
        GATHER(Gather Objects)
      end
      subgraph InventoryDatabase
        OF[/Object Files/]
      end
      subgraph Billing Database 
        BT[/"Bitstream Test Results (future)
        - format identification
        - PII scan
        - accessiblity scan
        "/]
      end
      OF-.->GATHER
      BITSCAN("Bitstream Scan Process
      assumes a cloud solution will exist")
      OF-->BITSCAN
      BITSCAN<-->BT
      CLOUD((Cloud Storage))
      CLOUD-->BITSCAN
      BT-->GATHER

Loading

Estimates - Initial Analysis

Retrospective Json Generation

4M object * 8KB/object = 32GB
- or 3M if we exclude specific collections
32GB RDS storage (billing.object_health_json)
32GB OpenSearch storage (objhealth index)
Compute to produce 4M JSON documents
- use available compute on Merritt batch server
- run within Lambda

Prospective Json Generation (adds and updates to objects)

8K/object added to RDS and OpenSearch
Compute to process updates

Test refinement

Compute to partially re-process 4M objects

Estimates - Statistical Analysis

Weekly cron jobs
Additional join tables
- repeated checksums (1M rows)
- average/mean size per mime per collection (may already exist in the billing database)

Estimates - Bitstream Analysis

Tasks to run

File identification
- All files 34M
PII scan?
- Data, Text, Image? (30M)
Accessibility scan?
- Text, Video? (20M)

Content Breakdown

audio - 174,608
container - 448,449
data -- Total -- 1,579,481
image -- Total -- 11,689,378
text -- Total -- 19,797,445
video -- Total -- 68,931

Cost

Assumption: 1K json per bitstream = 34G in RDS
Compute per bitstream - will need to calculate by service
- File Id - open source
- Accessibility - open source?
- PII scan - vendor solution
Bitstream retrieval - 500T

Script Notes

Build Sample Objects Across All Collections

bundle exec ruby object_health.rb -bat --limit 10 --tag=tag_test_set --loop=5

Build Sample Objects in Default Collection (merritt_demo)

bundle exec ruby object_health.rb -bat --limit 10

Build Sample Objects in a Specified Collection (merritt_demo)

bundle exec ruby object_health.rb -bat --mnemonic=escholarship --limit 10

Rebuild Objects by id

bundle exec ruby object_health.rb -bat --id=10

Re-build all objects

WARNING. This will take a long time

Clear built objects in a collection and rebuild 50

bundle exec ruby object_health.rb -bat --limit 50 --mnemonic=escholarship --clear-build

Resume build/rebuild of objects in a collection

bundle exec ruby object_health.rb -bat --limit 50 --mnemonic=escholarship

Clear built objects and rebuild

bundle exec ruby object_health.rb -bat --limit 500 --query=has-build --clear-build

Continue rebuild without clearing

bundle exec ruby object_health.rb -bat --limit 10000 --query=has-build

Re-analyze all objects

Clear analysis objects and re-analyze

bundle exec ruby object_health.rb -at --limit 10000 --query=has-build --clear-analysis

Re-analyze without clearing

bundle exec ruby object_health.rb -at --limit 10000 --query=has-build

Re-test

bundle exec ruby object_health.rb -t --limit 10000 --query=has-build --clear-tests

Files

README.md

Latest commit

History

README.md

File metadata and controls

Merritt Collection Health Analysis and Reporting

Overview

Use Cases

Configure

Enhance

Examine and Report: Files

Examine and Report: Objects

Usage

Collection Health Checks

File level

Object level

Implementation Strategy

Leveraging OpenSearch

Collection Health Object Analysis

Set environment (DEV)

New Database Table

Install

Invocation

System Design

Initial Analysis - Built from RDS Data

Statistical Analysis

Bitstream Analysis

Estimates - Initial Analysis

Retrospective Json Generation

Prospective Json Generation (adds and updates to objects)

Test refinement

Estimates - Statistical Analysis

Estimates - Bitstream Analysis

Tasks to run

Content Breakdown

Cost

Script Notes

Build Sample Objects Across All Collections

Build Sample Objects in Default Collection (merritt_demo)

Build Sample Objects in a Specified Collection (merritt_demo)

Rebuild Objects by id

Re-build all objects

Re-analyze all objects

Re-test