PhEval Pipeline

TLDR

The Pipeline presented on PhEval preprint was moved to a new repository - Monarch PhEval.

NOTE: The default Monarch PhEval pipeline, as proposed in the paper preprint, requires approximately 1 TB of disk space. Learn how to modify the pipeline configuration here to customize the experiments.

1. Clone Monarch PhEval

git clone https://github.com/monarch-initiative/monarch_pheval.git

2. Installing PhEval Pipeline dependencies

Enter in the cloned folder and enter the following commands:

poetry shell
poetry install

3. Executing Pipeline

make pheval

Pipeline Description

The Pipeline is divided in three main steps

1. Data Preparation Phase

The data preparation phase, checks the completeness of the disease, gene and variant input data and optionally preparing simulated VCF files if required, gives the user the ability to randomise phenotypic profiles using the PhEval corpus scramble command utility, allowing for the assessment of how well VGPAs handle noise and less specific phenotypic profiles when making predict.

2. Runner Phase

The runner phase is structured into three stages: prepare, run, and post-process.

The prepare step plays a crucial role in adapting the input data to meet the specific requirements of the tool.
In the run step, the VGPA is executed, applying the selected algorithm to the prepared data and generating the tool-specific outputs. Within the run stage, an essential task is the generation of input command files for the algorithm. These files serve as collections of individual commands, each tailored to run the targeted VGPA on specific samples. These commands are configured with the appropriate inputs, outputs and specific configuration settings, allowing for the automated and efficient processing of large corpora.
Finally, the post-processing step takes care of harmonising the tool-specific outputs into standardised PhEval TSV format, ensuring uniformity and ease of analysis of results from all VGPAs. In this context, the tool-specific output is condensed to provide only two essential elements, the entity of interest, which can either be a variant, gene, or disease, and its corresponding score. PhEval then assumes the responsibility of subsequent standardisation processes. This involves the reranking of the results in a uniform manner, ensuring that fair and comprehensive comparisons can be made between tools.

3. Analysis Phase

In the analysis phase, PhEval generates comprehensive statistical reports based on standardised outputs from the runner phase.

Customising PhEval Pipeline Experiments

The PhEval pipeline is orchestrated using a Makefile Jinja template strategy. Therefore, to describe a new experiment in the pipeline, the user needs to generate a Makefile workflow based on a configuration file.

In the resources folder are the following files responsible for Makefile generation:

📦resources
┣ 📜Makefile.j2
┣ 📜custom.Makefile
┣ 📜generatemakefile.sh
┗ 📜pheval-config.yaml

Let's begin by describing the pheval-config.yaml file and its structure.

PhEval Configuration File

This file is responsible define the experiment settings and will be used to generate the Makefile using a Jinja template which consumes this YAML configuration file.

Directories Section

The data and tmp properties are mandatory and must be specified in this section.

data property refers to the folder location where the necessary phenotypic data for the pipeline will be downloaded and extracted.
tmp property points to the folder where all temporary intermediate files will be generated.

directories:
  data: data
  tmp: data/tmp

Corpora Section

The corpora section specifies which corpus will be used in the experiment. In this example is defined LIRICAL corpus, A small comparison corpus created for benchmarking the LIRICAL system which contains 385 case reports.

The user needs to specify corpus id and it must be equals to the corpora folder structure, e.g.

📦corpora
┃ ┣ 📂lirical
┃ ┣ ┣ 📂small_version
┃ ┣ ┣ ┣ 📂phenopackets
┃ ┣ ┣ ┣ ┣ 📜PATIENT1.json
┃ ┣ ┣ ┣ ┣ 📜PATIENT2.json
┃ ┣ ┣ ┣ 📂vcf
┃ ┣ ┣ ┣ ┣ 📜PATIENT1.vcf.gz
┃ ┣ ┣ ┣ ┣ 📜PATIENT2.vcf.gz
┃ ┣ ┣ ┣ 📜corpus.yml
┃ ┣ ┣ ┣ 📜template_exome_hg19.vcf.gz

corpora:
  - id: lirical
    variant: small_version

Configs Section

The configs section holds all custom configurations for the different VGPAs. It must declare: - tool: VGPA tool name. - id: it's an arbiratry unique identifier that will be used in the runs section - version: VGPA tool version

configs:
  - tool: phen2gene
    id: phen2gene-1.2.3
    version: 1.2.3

configs section can also deal with special VGPA data preparation steps, for example, Semantic Similarity ingestions into Exomiser phenotypic database e.g.

configs:
  - tool: exomiser
    id: exomiser-semsim-ingest-13.3.0
    version: 13.3.0
    phenotype: 2309
    preprocessing:
      - phenio-monarch-hp-hp.0.4.semsimian.sql

phenotype property describes the Exomiser phenotype database version and the preprocessing section will execute SQL scripts into that phenotypic database.

Runs Section

The "runs" section will integrate all previously described sections and pass them to pheval VGPA for concrete execution.

tool property specifies which runner will be called
corpus and corpusvariant must match properties declared on the corpora section.
version should correspond to the tool version
configuration must match the id described on the configuration section.

runs:
  - tool: exomiser
    corpus: lirical
    corpusvariant: small_version
    version: 13.3.0
    configuration: exomiser-semsim-ingest-13.3.0

Generating new Makefile based on PhEval configuration file

📦resources
┣ 📜generatemakefile.sh
┗ 📜pheval-config.yaml

To generate a new Makefile, simply execute the generatemakefile.sh script, which encapsulates the Makefile rendering process dynamically filling it using the pheval-config.yaml configuration file.

./resources/generatemakefile.sh