Variant Predicates

Variant predicate is a core component to classify a cohort using the genomic variants identified in the cohort members. A VariantPredicate tests if a Variant meets the inclusion criteria. For instance, a predicate can test if a variant is a deletion, leads to a missense change, or overlaps with a protein domain.

An array of builtin variant predicates is available as functions of the gpsea.analysis.predicate module.

The predicates operate on several lines of information:

Information

Example

Allele

variant is e.g. a deletion of >50 bases

Functional annotation

variant leads to a missense change or affects n-th exon of a transcript

Protein data

variant is located in a region encoding a protein domain, protein feature type

Genome

overlap with a genomic region of interest

The scope of the builtin predicates is fairly narrow and likely insufficient for real-life analyses. However, the predicates can be chained into a compound predicate to achive more expressivity for testing complex conditions, such as “variant is a missense or synonymous variant located in exon 6 of NM_013275.6”.

Examples

Here we show examples of several simple variant predicates and how to chain them for testing complex conditions.

Load cohort

Let’s start by loading a Cohort of 19 individuals with mutations in RERE leading to Holt-Oram syndrome. The cohort was prepared from phenopackets as described in Create a cohort section, and then serialized as a JSON file following the instructions in Persist the cohort for later section.

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort_json = 'docs/cohort-data/RERE.0.1.20.json'
>>> with open(fpath_cohort_json) as fh:
...     cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
19

Some individuals were found to harbor the variant 1_8358231_8358231_T_C that corresponds to a pathogenic mutation VCV000522858.5 that replaces the histidine encoded by the 1435th codon of NM_001042681.2 with arginine: NM_001042681.2(RERE):c.4304A>G (p.His1435Arg). We can retrieve the variant by querying the cohort by the variant key:

>>> variant_key_of_interest = '1_8358231_8358231_T_C'
>>> variant = cohort.get_variant_by_key(variant_key_of_interest)

Builtin predicates

Let’s use builtin predicates to verify the properties of the variant 1_8358231_8358231_T_C.

We can check that the variant overlaps with RERE

>>> import gpsea.analysis.predicate as vp
>>> gene = vp.gene('RERE')
>>> gene.test(variant)
True

it overlaps with the MANE transcript

>>> rere_mane_tx_id = 'NM_001042681.2'
>>> tx = vp.transcript(rere_mane_tx_id)
>>> tx.test(variant)
True

it in fact overlaps with the exon 20,

>>> exon20 = vp.exon(exon=20, tx_id=rere_mane_tx_id)
>>> exon20.test(variant)
True

and leads to a missense mutation with respect to the MANE transcript

>>> from gpsea.model import VariantEffect
>>> missense = vp.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=rere_mane_tx_id)
>>> missense.test(variant)
True

See the gpsea.analysis.predicate module for a complete list of the builtin predicates.

Predicate chain

Using the builtin predicates, we can build a logical chain to test complex conditions. For instance, we can test if the variant meets any of several conditions:

>>> import gpsea.analysis.predicate as vp
>>> nonsense = vp.variant_effect(VariantEffect.STOP_GAINED, tx_id=rere_mane_tx_id)
>>> missense_or_nonsense = missense | nonsense
>>> missense_or_nonsense.test(variant)
True

or all conditions:

>>> missense_and_exon20 = missense & exon20
>>> missense_and_exon20.test(variant)
True

All variant predicates overload Python & (AND) and | (OR) operators, to allow chaining.

Therefore, there is nothing that prevents us to combine the predicates into multi-level tests, e.g. to test if the variant is a “chromosomal deletion” or a deletion which removes at least 50 bp:

>>> from gpsea.model import VariantClass
>>> chromosomal_deletion = "SO:1000029"
>>> predicate = vp.structural_type(chromosomal_deletion) | (vp.variant_class(VariantClass.DEL) & vp.change_length("<=", -50))
>>> predicate.description
'(structural type is SO:1000029 OR (variant class is DEL AND change length <= -50))'

Inverting conditions

Sometimes we may want to test the variant for a condition that must not be met. For instance, we may want to test if the variant is a deletion that is not predicted to shift the transcript reading frame. One of doing this would be to build a compound predicates for all variant effects except of FRAMESHIFT_VARIANT:

>>> non_frameshift_effects = (
...   VariantEffect.SYNONYMOUS_VARIANT, VariantEffect.MISSENSE_VARIANT, VariantEffect.INTRON_VARIANT,
...   # and many more effects..
... )
>>> non_frameshift_predicate = vp.allof(vp.variant_effect(eff, tx_id=rere_mane_tx_id) for eff in non_frameshift_effects)

However, this is clearly much better implemented by a logical not of a “is frameshift” predicate.

Therefore, all variant predicates implement logical inversion which corresponds to Python’s ~ operator (tilde), and results in an inverted predicate.

This is how we can use the predicate inversion to build the predicate for non-frameshift deletions:

>>> non_frameshift_del = ~vp.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id=rere_mane_tx_id) & vp.variant_class(VariantClass.DEL)
>>> non_frameshift_del.description
'(NOT FRAMESHIFT_VARIANT on NM_001042681.2 AND variant class is DEL)'

Note the presence of a tilde ~ before the variant effect predicate and resulting NOT in the predicate question.

Need more?

The builtin predicates should cover majority of use cases. However, if a predicate seems to be missing, feel free to submit an issue in our GitHub tracker, or to implement a custom predicate by extending the VariantPredicate class 😎.

The variant predicate offers a flexible API for testing if variants meet a condition. However, the genotype phenotype correlations are done on the individual level and the variant predicates are used as a component of the genotype predicate. The next sections show how to use variant predicates to assign individuals into groups.