Variant Predicates
GPSEA uses the VariantPredicate
class
to test if a Variant
meets the inclusion criteria.
The variant predicate can leverage multiple primary data:
Primary data source |
Example |
---|---|
Allele |
the variant being a deletion or a single nucleotide variant (SNV) |
Genome |
overlaps of a target genomic region |
Functional annotation |
variant is predicted to lead to a missense change or affect an exon of certain transcript |
Protein data |
variant is located in a region encoding a protein domain, protein feature type |
As a rule of thumb, the predicates for testing basic conditions are available off the shelf, and they can be used as building block for testing for more complex conditions, such as testing if the variant is “a missense or synonymous variant located in exon 6 of transcript NM_013275.6”.
Let’s demonstrate the variant predicate usage on a few examples.
Load cohort
For the purpose of this example, we will load a Cohort
of 19 individuals with mutations in RERE leading to Holt-Oram syndrome.
The cohort was prepared from phenopackets as described in Create a cohort from GA4GH phenopackets section,
and then serialized as
a JSON file
following the instructions in Persist the cohort for later section.
>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort_json = 'docs/cohort-data/RERE.0.1.20.json'
>>> with open(fpath_cohort_json) as fh:
... cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
19
To demonstrate the predicate API, we will use the variant 1_8358231_8358231_T_C
that corresponds
to a pathogenic variant VCV000522858.5
that replaces the histidine encoded by the 1435th codon of NM_001042681.2 with arginine: NM_001042681.2(RERE):c.4304A>G (p.His1435Arg)
.
>>> variant_key_of_interest = '1_8358231_8358231_T_C'
>>> variant = cohort.get_variant_by_key(variant_key_of_interest)
Building blocks
We can check that the variant overlaps with RERE:
>>> from gpsea.analysis.predicate.genotype import VariantPredicates
>>> gene = VariantPredicates.gene('RERE')
>>> gene.test(variant)
True
it overlaps with the MANE transcript:
>>> rere_mane_tx_id = 'NM_001042681.2'
>>> tx = VariantPredicates.transcript(rere_mane_tx_id)
>>> tx.test(variant)
True
it in fact overlaps with the exon 20:
>>> exon20 = VariantPredicates.exon(exon=20, tx_id=rere_mane_tx_id)
>>> exon20.test(variant)
True
and leads to a missense mutation with respect to the MANE transcript:
>>> from gpsea.model import VariantEffect
>>> missense = VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=rere_mane_tx_id)
>>> missense.test(variant)
True
See VariantPredicates
for more info on the predicates available off the shelf.
Complex conditions
We can combine the building blocks to test for more elaborate conditions. For instance, we can test if the variant meets any or several conditions:
>>> nonsense = VariantPredicates.variant_effect(VariantEffect.STOP_GAINED, tx_id=rere_mane_tx_id)
>>> missense_or_nonsense = missense | nonsense
>>> missense_or_nonsense.test(variant)
True
or all conditions:
>>> missense_and_exon20 = missense & exon20
>>> missense_and_exon20.test(variant)
True
The VariantPredicate overloads Python &
(AND) and |
(OR) operators to build a compound predicate from lower level building blocks.
Therefore, there is nothing that prevents us to combine the predicates into multi-level tests, such as testing if the variant is a “chromosomal deletion” or a deletion which removes at least 50 bp:
>>> from gpsea.model import VariantClass
>>> chromosomal_deletion = "SO:1000029"
>>> predicate = VariantPredicates.structural_type(chromosomal_deletion) | (VariantPredicates.variant_class(VariantClass.DEL) & VariantPredicates.change_length("<=", -50))
>>> predicate.description
'(structural type is SO:1000029 OR (variant class is DEL AND change length <= -50))'
Inverting conditions
Sometimes we may want to test the variant for a condition that must not be met.
For instance, we may want to test if the variant is a deletion
that is not predicted to shift the transcript reading frame.
One of doing this would be to build a compound predicates
for all variant effects except of FRAMESHIFT_VARIANT
:
>>> non_frameshift_effects = (
... VariantEffect.SYNONYMOUS_VARIANT, VariantEffect.MISSENSE_VARIANT, VariantEffect.INTRON_VARIANT,
... # and many more effects..
... )
>>> non_frameshift_predicate = VariantPredicates.all(VariantPredicates.variant_effect(eff, tx_id=rere_mane_tx_id) for eff in non_frameshift_effects)
However, this is clearly tedious and it would be much better implemented by a simple logical not of a predicate for a frameshift variant effect.
To support this, VariantPredicate implements logical inversion
which corresponds to Python’s ~
operator (tilde), to wrap
the underlying predicate and to invert its test result.
This is how we can use the predicate inversion to build the predicate for non-frameshift deletions:
>>> non_frameshift_del = ~VariantPredicates.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id=rere_mane_tx_id) & VariantPredicates.variant_class(VariantClass.DEL)
>>> non_frameshift_del.description
'(NOT FRAMESHIFT_VARIANT on NM_001042681.2 AND variant class is DEL)'
Note the presence of a tilde ~
before the variant effect predicate and resulting NOT
in the predicate question.
The variant predicate offers a flexible API for testing if variants meet a condition. However, the genotype phenotype correlations are done on the individual level and the variant predicates are used as a component of the genotype predicate. The next sections show how to use variant predicates to assign individuals into groups.