Data preparation utilities
This page documents data preparation utilities provided by PhEval. These commands are used to prepare, normalise, and organise input data before running phenotype-driven tools via plugins.
This page only covers commands related to data preparation. Variant spiking and other specialised workflows are documented elsewhere.
Purpose
Data preparation utilities help to:
- Construct phenopacket corpora for evaluation
- Normalise gene identifiers
- Ensure consistent input structure across cohorts
- Reduce technical variability unrelated to tool performance
These steps are particularly important when benchmarking across tools, versions, or knowledge resources.
Preparing a phenopacket corpus
The prepare-corpus command is used to prepare a directory of phenopackets for downstream analysis.
Typical use cases include:
- Validating that phenopackets contain the required records
- Preparing separate corpora for gene-, disease-, or variant-based analyses
- Optionally generating associated VCFs for variant-based workflows
Basic example
Prepare a corpus of phenopackets for gene-based analysis:
pheval-utils prepare-corpus \
--phenopacket-dir phenopackets/ \
--gene-analysis \
--output-dir prepared_corpus/
Prepare a corpus of phenopackets for gene-based analysis and update all gene identifiers to Ensembl IDs:
pheval-utils prepare-corpus \
--phenopacket-dir phenopackets/ \
--gene-analysis \
--gene-identifier ensembl_id \
--output-dir prepared_corpus/
Variant-based analysis example
Prepare a corpus for variant-based analysis using an hg38 VCF template:
pheval-utils prepare-corpus \
--phenopacket-dir phenopackets/ \
--variant-analysis \
--output-dir prepared_corpus/
Prepare a corpus for variant-based analysis and spike variants into an hg38 VCF template:
pheval-utils prepare-corpus \
--phenopacket-dir phenopackets/ \
--variant-analysis \
--hg38-template-vcf hg38_template.vcf \
--output-dir prepared_corpus/
Notes:
- At least one of
--variant-analysis,--gene-analysis, or--disease-analysisshould be specified.- For variant-based analysis, a VCF template or directory is required.
- The prepared output directory is used as input to runners provided by plugins.
Updating phenopackets and identifiers
The update-phenopackets command is used to update gene symbols and identifiers in existing phenopackets.
This is useful when:
- Phenopackets contain outdated gene identifiers
- A consistent identifier scheme is required across a cohort
- Benchmarking is performed across different database or ontology versions
Example: update a directory of phenopackets
Update phenopackets to include Ensembl gene identifiers:
pheval-utils update-phenopackets \
--phenopacket-dir phenopackets/ \
--gene-identifier ensembl_id \
--output-dir updated_phenopackets/
Example: update a single phenopacket
pheval-utils update-phenopackets \
--phenopacket-path case_001.json \
--gene-identifier hgnc_id \
--output-dir updated_case/
How data preparation fits into a workflow
A typical workflow using data preparation utilities looks like:
- Collect or generate phenopackets
- Prepare and normalise phenopackets using data preparation utilities
- Run tools via plugin-provided runners using
pheval run - Benchmark and analyse the resulting outputs
Not all workflows require all preparation steps, but these utilities help ensure reproducibility and consistency.