Variant Predicates

Variant predicate is a core component to classify a cohort using the genomic variants identified in the cohort members. A VariantPredicate tests if a Variant meets the inclusion criteria. For instance, a predicate can test if a variant is a deletion, leads to a missense change, or overlaps with a protein domain.

An array of builtin variant predicates is available as functions of the gpsea.analysis.predicate module.

The predicates operate on several lines of information:

Information

Example

Allele

variant is e.g. a deletion of >50 bases

Functional annotation

variant leads to a missense change or affects n-th exon of a transcript

Protein data

variant is located in a region encoding a protein domain, protein feature type

The scope of the builtin predicates is fairly narrow and likely insufficient for real-life analyses. However, several predicates can be “chained” into a compound predicate using a boolean logic, to achive more expressivity for testing complex conditions, such as “variant is a missense or synonymous variant located in exon 6 of NM_013275.6”.

Examples

Here we show how to use the builtin predicates for simple tests and how to build a compound predicate from the builtin predicates, for testing complex conditions.

Load cohort

Let’s start by loading a Cohort of 19 individuals with mutations in RERE leading to Holt-Oram syndrome. The cohort was prepared from phenopackets as described in Create a cohort section, and then serialized as a JSON file following the instructions in Persist the cohort for later section.

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort_json = 'docs/cohort-data/RERE.0.1.20.json'
>>> with open(fpath_cohort_json) as fh:
...     cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
19

Some individuals were found to harbor the variant 1_8358231_8358231_T_C that corresponds to a pathogenic mutation VCV000522858.5 that replaces the histidine encoded by the 1435th codon of NM_001042681.2 with arginine: NM_001042681.2(RERE):c.4304A>G (p.His1435Arg). We can retrieve the variant by querying the cohort by the variant key:

>>> variant_key_of_interest = '1_8358231_8358231_T_C'
>>> variant = cohort.get_variant_by_key(variant_key_of_interest)

Builtin predicates

Let’s use builtin predicates to verify the properties of the variant 1_8358231_8358231_T_C.

We can check that the variant overlaps with RERE

>>> import gpsea.analysis.predicate as vp
>>> gene = vp.gene('RERE')
>>> gene.test(variant)
True

it overlaps with the MANE transcript

>>> rere_mane_tx_id = 'NM_001042681.2'
>>> tx = vp.transcript(rere_mane_tx_id)
>>> tx.test(variant)
True

it in fact overlaps with the exon 20,

>>> exon20 = vp.exon(exon=20, tx_id=rere_mane_tx_id)
>>> exon20.test(variant)
True

and leads to a missense mutation with respect to the MANE transcript

>>> from gpsea.model import VariantEffect
>>> missense = vp.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=rere_mane_tx_id)
>>> missense.test(variant)
True

See the gpsea.analysis.predicate module for a complete list of the builtin predicates.

Compound predicates

A compound predicate for testing complex conditions can be built from two or more predicates. For instance, we can test if the variant meets any of several conditions:

>>> import gpsea.analysis.predicate as vp
>>> nonsense = vp.variant_effect(VariantEffect.STOP_GAINED, tx_id=rere_mane_tx_id)
>>> missense_or_nonsense = missense | nonsense
>>> missense_or_nonsense.test(variant)
True

or all conditions:

>>> missense_and_exon20 = missense & exon20
>>> missense_and_exon20.test(variant)
True

All variant predicates overload Python & (AND) and | (OR) operators, to combine a predicate pair into a compound predicate.

Note

Combining three or or more predicates can be achieved with allof() and anyof() functions.

Therefore, there is nothing that prevents us to combine the predicates into multi-level tests, e.g. to test if the variant is a “chromosomal deletion” or a deletion which removes at least 50 bp:

>>> from gpsea.model import VariantClass
>>> chromosomal_deletion = "SO:1000029"
>>> predicate = vp.structural_type(chromosomal_deletion) | (vp.variant_class(VariantClass.DEL) & vp.change_length("<=", -50))
>>> predicate.description
'(structural type is SO:1000029 OR (variant class is DEL AND change length <= -50))'

Inverting conditions

Sometimes we may want to test the variant for a condition that must not be met. For instance, we may want to test if the variant is a deletion that is not predicted to shift the transcript reading frame. One of doing this would be to build a compound predicates for all variant effects except of FRAMESHIFT_VARIANT:

>>> non_frameshift_effects = (
...   VariantEffect.SYNONYMOUS_VARIANT, VariantEffect.MISSENSE_VARIANT, VariantEffect.INTRON_VARIANT,
...   # and many more effects..
... )
>>> non_frameshift_predicate = vp.allof(vp.variant_effect(eff, tx_id=rere_mane_tx_id) for eff in non_frameshift_effects)

However, this is clearly much better implemented by a logical not of a “is frameshift” predicate.

Therefore, all variant predicates implement logical inversion which corresponds to Python’s ~ operator (tilde), and results in an inverted predicate.

This is how we can use the predicate inversion to build the predicate for non-frameshift deletions:

>>> non_frameshift_del = ~vp.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id=rere_mane_tx_id) & vp.variant_class(VariantClass.DEL)
>>> non_frameshift_del.description
'(NOT FRAMESHIFT_VARIANT on NM_001042681.2 AND variant class is DEL)'

Note the presence of a tilde ~ before the variant effect predicate and resulting NOT in the predicate question.

Need more?

The builtin predicates should cover majority of use cases. However, if a predicate seems to be missing, feel free to submit an issue in our GitHub tracker, or implement your own predicate by following the Variant predicate guide.

Next

The variant predicate offers a flexible API for testing if variants meet a condition. However, the genotype phenotype correlations are studied on the level of individuals. As described in Genotype classifiers, GPSEA uses the GenotypeClassifier API to assign individuals into non-overlapping classes. Variant predicates are essential for creating such classifier. We explain the details in the following sections.