HPC pipeline to integrate knowledge graphs from EMBL-EBI resources, the MONARCH Initiative KG, ROBOKOP, Ubergraph, and other sources into giant (multi-terabyte) materialised, clique merged Neo4j+Solr+RocksDB databases.
The resulting graphs can be downloaded from https://ftp.ebi.ac.uk/pub/databases/spot/kg/ebi/
The pipeline is implemented as Rust programs with simple CLIs, orchestrated with Nextflow.
The primary output the pipeline is a property graph for Neo4j. The input format (after ingests to extract from KGX, RDF, and bespoke DB formats) is simple JSONL files, to which "bruteforce" integration is applied:
- All strings that begin with any IRI or CURIE prefix from the Bioregistry are canonicalised to the standard CURIE form
- All property values that are the identifier of another node in the graph become edges
- Cliques of equivalent nodes are merged into single nodes
- Cliques of equivalent properties are merged into single properties (and for ontology-defined properties, the qualified safe labels are used)
In addition to Neo4j, the nodes and edges are loaded into Solr for full-text search and RocksDB for id->object resolution.