Compare phenotype scores

In this section, we show how to test for an association between a genotype class and a phenotype score. We assume a cohort was preprocessed following the Input data section, and we use a genotype classifier and a phenotype scorer from the Partitioning section to obtain a genotype class and phenotype score for each cohort member. We use Mann-Whitney U test to test for differences in phenotype score distributions between the classes.

Mann-Whitney U Test

We may want to compare the total number of occurences of a specific set of phenotypic features between two different genotype classes. For instance, Jordan et al (2018) found that the total number of structural defects of the brain, eye, heart, and kidney and sensorineural hearing loss seen in individuals with point mutations in the Atrophin-1 domain of RERE is significantly higher than expected based on the number of similar defects seen in individuals with putative loss-of-function variants. Since there are five potential defects, each individual has a count ranging between 0 and 5.

We perform a Mann-Whitney U Test (or Wilcoxon Rank-Sum Test) to compare the distribution of such counts between genotype classes. This is a non-parametric test that compares the medians of the two classes to determine if they come from the same distribution.

>>> import scipy.stats as stats
>>> class1 = [0, 0, 1, 0, 2, 0, 1, 1, 1, 0, 2, 0, 0, 3, 1, 1, 1, 0]
>>> class2 = [4, 5, 3, 4, 3, 3, 3, 4, 4, 5, 5, 2, 3, 0, 3, 5, 2, 3]
>>> r = stats.mannwhitneyu(x=class1, y=class2, alternative = 'two-sided')
>>> p_value = r.pvalue
>>> float(p_value)
6.348081479150902e-06

p value of 6.348081479150901e-06 suggests a significant difference between the classes.

Example analysis

Load HPO

We will start the analysis with loading HPO v2024-07-01:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')
>>> hpo.version
'2024-07-01'

Load cohort

Let’s now analyze the subjects reported in Jordan et al. We will load 19 phenopackets that represent individuals with mutations in RERE whose signs and symptoms were encoded into HPO terms and deposited into Phenopacket Store. We will load the Cohort from a JSON file. The cohort was prepared from phenopackets as described in Create a cohort section, and then serialized as a JSON file following the instructions in Persist the cohort for later section.

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort_json = 'docs/cohort-data/RERE.0.1.20.json'
>>> with open(fpath_cohort_json) as fh:
...     cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
19

Configure analysis

Now we can set up the analysis of genotype and phenotype. We will perform the analysis using the RERE transcript selected as the “main” biologically relevant by the MANE consortium.

>>> tx_id = 'NM_001042681.2'

Genotype predicate

Jordan et al. compare phenotype of individuals harboring point mutations with the individuals carrying loss of function mutations. Let’s create a predicate for testing if the variant is a point mutation or a loss of function mutation.

In this example, the point mutation is a mutation that meets the following conditions:

>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate import change_length, ref_length, anyof, variant_effect
>>> point_mutation_effects = (
...     VariantEffect.MISSENSE_VARIANT,
... )
>>> point_mutation = change_length('==', 0) \
...     & ref_length('==', 1) \
...     & anyof(variant_effect(effect, tx_id) for effect in point_mutation_effects)
>>> point_mutation.description
'((change length == 0 AND reference allele length == 1) AND MISSENSE_VARIANT on NM_001042681.2)'

For the loss of function predicate, the following variant effects are considered loss of function:

>>> lof_effects = (
...     VariantEffect.TRANSCRIPT_ABLATION,
...     VariantEffect.FRAMESHIFT_VARIANT,
...     VariantEffect.START_LOST,
...     VariantEffect.STOP_GAINED,
... )
>>> lof_mutation = anyof(variant_effect(eff, tx_id) for eff in lof_effects)
>>> lof_mutation.description
'(TRANSCRIPT_ABLATION on NM_001042681.2 OR FRAMESHIFT_VARIANT on NM_001042681.2 OR START_LOST on NM_001042681.2 OR STOP_GAINED on NM_001042681.2)'

The genotype predicate will bin the patient into two classes: a point mutation or the loss of function:

>>> from gpsea.analysis.clf import monoallelic_classifier
>>> gt_clf = monoallelic_classifier(
...     a_predicate=point_mutation,
...     b_predicate=lof_mutation,
...     a_label="Point", b_label="LoF",
... )
>>> gt_clf.class_labels
('Point', 'LoF')

Phenotype score

This component is responsible for computing a phenotype score for an individual. As far as GPSEA framework is concerned, the phenotype score must be a floating point number or a NaN value if the score cannot be computed for an individual.

Here we use the CountingPhenotypeScorer for scoring the individuals based on the number of structural defects from the following 5 categories:

  • Brain anomalies

  • Eye anomalies

  • Congenital heart defects

  • Renal anomalies

  • Sensorineural hearing loss

For example, an individual with a congenital heart defect would be assigned a score of 1, an individual with congenital heart defect and a renal anomaly would be assigned a score of 2, and so on. If an individual had two heart defects (e.g., atrial septal defect and ventricular septal defect), a score of 1 (not 2) would be assigned for the heart defect category.

The CountingPhenotypeScorer automatizes this scoring method by encoding the categories into HPO terms:

>>> structural_defects = (
...     'HP:0012443',  # Abnormal brain morphology (Brain anomalies)
...     'HP:0012372',  # Abnormal eye morphology (Eye anomalies)
...     'HP:0001627',  # Abnormal heart morphology (Congenital heart defects)
...     'HP:0012210',  # Abnormal renal morphology (Renal anomalies)
...     'HP:0000407',  # Sensorineural hearing impairment (Sensorineural hearing loss)
... )

and then tests the individuals for presence of at least one HPO term that corresponds to the structural defect (e.g. Abnormal brain morphology, exact match) or that is its descendant (e.g. Cerebellar atrophy).

We construct the scorer with from_query_curies() function:

>>> from gpsea.analysis.pscore import CountingPhenotypeScorer
>>> pheno_scorer = CountingPhenotypeScorer.from_query_curies(
...     hpo=hpo,
...     query=structural_defects,
... )
>>> pheno_scorer.description
'Assign a phenotype score that is equivalent to the count of present phenotypes that are either an exact match to the query terms or their descendants'

Statistical test

We will use Mann-Whitney U Test to test for differences between scores of the different genotype groups

>>> from gpsea.analysis.pscore.stats import MannWhitneyStatistic
>>> score_statistic = MannWhitneyStatistic()

Tip

See gpsea.analysis.pscore.stats module for more statistical tests available for using with phenotype scores.

Final analysis

We will put the final analysis together into PhenotypeScoreAnalysis.

>>> from gpsea.analysis.pscore import PhenotypeScoreAnalysis
>>> score_analysis = PhenotypeScoreAnalysis(
...     score_statistic=score_statistic,
... )

Analysis

We execute the analysis by running

>>> result = score_analysis.compare_genotype_vs_phenotype_score(
...     cohort=cohort,
...     gt_clf=gt_clf,
...     pheno_scorer=pheno_scorer,
... )

In case of the RERE cohort, the analysis shows a significant difference between the number of structural defects in individuals with point vs. loss-of-function mutations.

>>> result.pval
0.012074957610483744

To explore further, we can access a data frame with genotype categories and phenotype counts:

>>> scores = result.data.sort_index()
>>> scores.head()  
                                      genotype  phenotype
patient_id
Subject 10[PMID_27087320_Subject_10]         1          0
Subject 1[PMID_27087320_Subject_1]           0          4
Subject 1[PMID_29330883_Subject_1]           1          0
Subject 2[PMID_27087320_Subject_2]        None          4
Subject 2[PMID_29330883_Subject_2]           1          1

The data frame provides a genotype category and a phenotype_score for each patient. The genotype category should be interpreted in the context of the genotype predicate:

>>> gt_id_to_name = {c.category.cat_id: c.category.name for c in gt_clf.get_categorizations()}
>>> gt_id_to_name
{0: 'Point', 1: 'LoF'}

The genotype code 0 is assigned to patients with a point mutation, 1 corresponds to the loss-of-function mutations, and None is assigned to patients who cannot be assigned into any of the groups.

Last, let’s use plot_boxplots() method to visualize the phenotype score distributions:

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(6, 4), dpi=120)
>>> result.plot_boxplots(
...     ax=ax,
... )
>>> _ = ax.grid(axis="y")
>>> _ = ax.set(
...     ylabel="Phenotype score", ylim=(-0.5, len(structural_defects) + 0.5)
... )
Phenotype score distribution

We see that the individuals with the point mutations feature structural defects than the individuals with the loss-of-function mutations.

The box extends from the first quartile (Q1) to the third quartile (Q3) of the data, with a red line at the median. The whiskers extend from the box to the farthest data point lying within 1.5x the inter-quartile range (IQR) from the box.