This project is intended to validate, normalize and score evidence for Open Targets (OT) Platform, and is structured as a counterpart to genetics-pipe. The pipeline here replaces much of what was originally maintained in data_pipeline, moving it to a more concise and efficient Scala/Spark framework. It is also intended to make scoring much more configurable for any clients that may wish to tune scoring coefficients for their own purposes.
There are two primary steps in this pipeline:
- Evidence Preparation
- See EvidencePreparationPipeline.scala for implementation details.
- This phase of the pipeline will validate and normalize the json evidence strings associated with OT data sources. These files are generated in large part by platform-input-support and are primarily stored in Google Storage (GS). Links to files for each source are maintained in a pipeline-configuration file that is updated with each new release.
- At a high level, this step requires Elasticsearch index dumps (for gene/disease metadata) as well as GS files and produces a single parquet dataset (schema).
- Notable operations performed in this phase include:
- Validation of evidence strings against the OT Evidence Schema
- Normalization of UniProt and non-reference (i.e. genes defined against non-reference assemblies, typically in highly polymorphic regions) targets
- Evidence code aggregation and static scoring; i.e. some scores are defined purely based on evidence codes and need to overriden in this phase (see here for details)
- Target and disease validation based on ensembl and EFO accessions, respectively
- Aggregation of all filtering and nearly all mutation operations into ancillary datasets that can be used to trace why records were lost or altered; see this Validation Error Report for an example summary.
- Evidence Scoring
- See ScoringPreparationPipeline.scala and ScoringCalculationPipeline.scala for implementation details
- This phase of the pipeline will score evidence created by the preparation step. While this will likely expand in the future to include more of the parameters used in scoring, the data source weights, at least, are configurable as shown here in application.conf
- Most of the trickier details related to per-source handling of evidence can be found in Scoring.scala
Both evidence preparation and scoring can be validated against output from the original data_pipeline implementation in evidence-prep-validation.ipynb and scoring-validation.ipynb, respectively. These notebooks contain checks for target/disease presence and field equality across all data sources. There were several issues encountered that are mentioned in the notebooks but all data was found equivalent outside of the issues raised on github.
There are also tests like this one intended to preserve a subset of these checks as part of the CI build.
The configuration for the pipeline is determined entirely by application.conf (modify as necessary for your use case).
A key configuration property to keep in mind is pipeline.decorators.dataset-summary.enabled
. When this is true, provenance around evidence record mutation and filtering is preserved at expense of making the evidence prep pipeline take over 2x longer (~47min vs 22min). This is disabled by default.
The input-dir
, output-dir
, and data-resources.local-dir
properties also need to be changed if you are NOT using the provided ot-client
docker container.
Note: All of the below are specific to the 19.11 OT release
This project will expect two primary sources of input information and while the process outlined below is a bit cumbersome as of now, we expect to improve it after the scope of the project is solidified:
- Metadata index extracts from Elasticsearch
- The
gene
,eco
, andefo
indexes are currently required - These can be created one of the following two ways:
- By setting up and running data_pipeline yourself
- See data_pipeline#overview for general instructions
- See scripts/data_pipeline_exec.sh for a script that will run the necessary steps above (only everything up to the "association" step is needed)
- See scripts/data_pipeline_extract.py for instructions on how to create the json dumps from ES
- By downloading and decompressing the files at
https://storage.googleapis.com/platform-pipe/extract/{gene,eco,efo}.json.gz
- An example script to do this is:
mkdir -p $DATA_DIR/extract; cd $DATA_DIR/extract for index in gene eco efo; do mkdir ${index}.json wget -P ${index}.json https://storage.googleapis.com/platform-pipe/extract/${index}.json.gz gzip -d ${index}.json/${index}.json.gz done
- Evidence files
- See download_evidence_files.sh for a script that will download this information
- These files will collectively occupy about 23G of space (17G of which is from a single source, europepmc, so developers may find it convenient to remove or subset this file for testing)
The expected directory structure is shown below (once the pipeline has been run):
# Inputs
$DATA_DIR/extract/
$DATA_DIR/extract/gene.json
$DATA_DIR/extract/eco.json
$DATA_DIR/extract/efo.json
$DATA_DIR/extract/evidence_raw.json/{atlas-*.json, gwas-*.json, etc.}
# Outputs
$DATA_DIR/results/score_source.parquet
$DATA_DIR/results/score_association.parquet
$DATA_DIR/results/evidence_raw.parquet
Some expected times for pipeline runs are shown below for various configurations (in local mode on Ubuntu 18.04 8xCPU 128G RAM):
- Evidence Preparation
- ~12 minutes with raw evidence pre-serialized as parquet (no mutation/filtering provenance)
- ~22 minutes with evidence files read as uncompressed json (no mutation/filtering provenance)
- ~47 minutes with mutation/filtering provenance and json evidence file sources
- Score Calculation
- Both score preparation and calculation take around 2 minutes each
To build the project, run:
sbt clean assembly
# or for no tests: sbt "set test in assembly := {}" clean assembly
This will produce target/scala-2.12/platform-pipe.jar
which can be deployed or run locally.
To execute all pipeline steps, run the following using the ot-client
docker (see docker/README.md) container provided or your own cluster:
APP=$REPOS/platform-pipe/target/scala-2.12/platform-pipe.jar
for cmd in prepare-evidence prepare-scores calculate-scores; do
echo "Running command: $cmd"
# Note: High driver-memory below not necesssary on a cluster -- this is for runs in local mode
/usr/spark-2.4.4/bin/spark-submit \
--driver-memory 64g \
--class com.relatedsciences.opentargets.etl.Main $APP $cmd \
--config $HOME/repos/platform-pipe/src/main/resources/application.conf
done
Evidence test data generation script:
# Run script on ot-client container
/usr/spark-2.4.4/bin/spark-shell --driver-memory 12g \
--jars $HOME/data/ot/apps/platform-pipe.jar \
-i $HOME/data/ot/apps/scripts/create_evidence_test_datasets.sc \
--conf spark.ui.enabled=false --conf spark.sql.shuffle.partitions=1 \
--conf spark.driver.args="\
extractDir=$HOME/data/ot/extract,\
testInputDir=$HOME/repos/platform-pipe/src/test/resources/pipeline_test/input,\
testExpectedDir=$HOME/repos/platform-pipe/src/test/resources/pipeline_test/expected"
A pre-commit hook to run scalafmt is recommended for this repo though installation of scalafmt is left to developers. The Installation Guide has simple instructions, and the process used for Ubuntu 18.04 was:
cd /tmp/
curl -Lo coursier https://git.io/coursier-cli &&
chmod +x coursier &&
./coursier --help
sudo ./coursier bootstrap org.scalameta:scalafmt-cli_2.12:2.2.1 \
-r sonatype:snapshots \
-o /usr/local/bin/scalafmt --standalone --main org.scalafmt.cli.Cli
scalafmt --version # "scalafmt 2.2.1" at TOW
The pre-commit hook can then be installed using:
cd $REPOS/platform-pipe
chmod +x hooks/pre-commit.scalafmt
ln -s $PWD/hooks/pre-commit.scalafmt .git/hooks/pre-commit
After this, every commit will trigger scalafmt to run and --no-verify
can be
used to ignore that step if absolutely necessary.