Cohort exploratory analysis

As a general rule for statistical testing, it is preferable to formulate one or several hypotheses and to test these in a targeted way. Performing numerous tests without any specific plan comes with an increased danger of false-positive results (although it many be appropriate to generate hypotheses if a validation cohort is available).

Types of statistical test

GPSEA offers four statistical tests.

  • Fisher Exact Test (FET): Association between two categorical variables. In this case, the first categorical variable is presence/absense of a genotype and the second is presence/absense of annotation to an HPO term.

  • Mann Whitney U test: Association between presence/absence of a genotype with a phenotype score (e.g., devries).

  • t test: Association between presence/absence of a genotype with a numerical test result (e.g., potassium level)

  • log rank score: Association between presence/absence of a genotype with age-of-onset of disease or of an HPO feature or with mortality

Please consult the documentation pages for each of these tests if you are unsure which to use.

Power

Two things must be true to identify a valid genotype-phenotype correlation. First, and obviously, the correlation must actually exist in the dataset being analyzed. Second, there must be statistical power to identify the effect. Power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present (Wikipedia).

It will generally not be possible to perform a formal power calculation because the effect size of most genotype-phenotype correlations is not well known. But, for instance it would not make much sense to test for correlation with a specific variant that occurs in only one of 50 individuals in a cohort.

Exploration

The purpose of the exploratory analysis is thus to decide which tests to perform. GPSEA provides tables and graphics that visualize some of the salient aspects of the cohort and the distribution of the identified variants. We exemplify the exploratory analysis on a cohort of 156 individuals with mutations in TBX5 from Phenopacket Store 0.1.20. The cohort was preprocessed as described in the Input data section and stored at docs/cohort-data/TBX5.0.1.20.json in JSON format.

We start by loading the cohort from the JSON file:

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort = 'docs/cohort-data/TBX5.0.1.20.json'
>>> with open(fpath_cohort) as fh:
...     cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
156

then we will choose the transcript and protein identifiers, and we fetch the corresponding data: (see Choose the transcript and protein of interest for more info):

>>> from gpsea.model.genome import GRCh38
>>> from gpsea.preprocessing import VVMultiCoordinateService, configure_default_protein_metadata_service
>>> tx_id = "NM_181486.4"
>>> pt_id = "NP_852259.1"
>>> tx_service = VVMultiCoordinateService(genome_build=GRCh38)
>>> tx_coordinates = tx_service.fetch(tx_id)
>>> pm_service = configure_default_protein_metadata_service()
>>> protein_meta = pm_service.annotate(pt_id)

and we will load HPO v2024-07-01 for using in the exploratory analysis:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')

Interactive exploration

We designed GPSEA to integrate with interactive Python with a widespread use in contemporary scientific workflows.

The code for exploration is located in gpsea.view module. As a rule of thumb, the reports are provided as GpseaReport which leverages IPython’s “magic” to integrate with environments such as Jupyter notebook.

Cohort summary

We recommend that users start be generating a cohort summary with an overview about the HPO terms, variants, diseases, and variant effects that occur most frequently:

>>> from gpsea.view import CohortViewer
>>> viewer = CohortViewer(hpo)
>>> report = viewer.process(cohort=cohort, transcript_id=tx_id)
>>> report  
Cohort

GPSEA cohort analysis

Successfully loaded 156 individuals. 46 were recorded as male, 56 as female, and 54 as unknown sex. 156 were reported to be alive at the time of last encounter, and 0 were deceased. 1 had disease onset information and 51 had information about the age at last encounter. In the cohort, there were a total of 117 HPO terms used for annotation.

No errors encountered.

Top 10 HPO Terms

A total of 117 HPO terms were used to annotated the cohort.
HPO Term ID Seen in n individuals
Atrial septal defect HP:0001631 50
Ventricular septal defect HP:0001629 41
Hypoplasia of the radius HP:0002984 40
Triphalangeal thumb HP:0001199 36
Absent thumb HP:0009777 32
Short thumb HP:0009778 32
Abnormal carpal morphology HP:0001191 30
Secundum atrial septal defect HP:0001684 27
Absent radius HP:0003974 15
Cardiac conduction abnormality HP:0031546 14

Top 10 Variants

Variants are shown according to NM_181486.4. A total of 53 unique variants were identified in the cohort.
Seen in n individuals Variant key Variant Name Protein Variant Variant Class
22 12_114385521_114385521_C_T c.710G>A p.Arg237Gln MISSENSE_VARIANT
20 12_114401830_114401830_C_T c.238G>A p.Gly80Arg MISSENSE_VARIANT
8 12_114385563_114385563_G_A c.668C>T p.Thr223Met MISSENSE_VARIANT
6 12_114398675_114398675_G_T c.408C>A p.Tyr136Ter STOP_GAINED
5 12_114398682_114398682_C_CG c.400dup p.Arg134ProfsTer49 FRAMESHIFT_VARIANT
5 12_114399514_114399514_A_C c.361T>G p.Trp121Gly MISSENSE_VARIANT, SPLICE_REGION_VARIANT
5 12_114403792_114403792_C_CG c.106_107insC p.Ser36ThrfsTer25 FRAMESHIFT_VARIANT
4 12_114385522_114385522_G_A c.709C>T p.Arg237Trp MISSENSE_VARIANT
4 12_114398656_114398656_C_CG c.426dup p.Ala143ArgfsTer40 FRAMESHIFT_VARIANT
4 12_114403798_114403798_G_GC c.100dup p.Ala34GlyfsTer27 FRAMESHIFT_VARIANT

Diseases

Disease Name Disease ID Annotation Count
Holt-Oram syndrome OMIM:142900 156

Variant categories for NM_181486.4

Variant effect Annotation Count Percent
MISSENSE_VARIANT 85 50%
FRAMESHIFT_VARIANT 38 22%
STOP_GAINED 19 11%
SPLICE_REGION_VARIANT 10 6%
SPLICE_DONOR_VARIANT 7 4%
SPLICE_ACCEPTOR_VARIANT 2 1%
SPLICE_DONOR_5TH_BASE_VARIANT 2 1%
INTRON_VARIANT 2 1%
INFRAME_INSERTION 2 1%
STOP_RETAINED_VARIANT 2 1%
INFRAME_DELETION 1 1%

Distribution of variants across protein domains

GPSEA gathers information about protein domains from the UniProt API, and alternatively allows users to enter domain information manually (See Fetch protein data). Protein domains are distinct functional or structural units in a protein. For instance, the following graphic shows domains of the PLD1 protein. The HKD domains contribute to the catalytic activity of the protein whereas the PX and PH domains regulation PLD1 localization within the cell. Observations such as this may suggest testable hypotheses.

PLD1

Human PLD1 with PX, PH, and two HKD domains.

Users can create a table to display the protein domains and the variants located in them in order to decide whether it might be sensible to test for correlation between variants located in one or more protein domains and a certain phenotype.

This code will produce the following table on the basis of a cohort of individuals with variants in the TBX5 gene:

>>> from gpsea.view import ProteinVariantViewer
>>> cpd_viewer = ProteinVariantViewer(tx_id=tx_id, protein_metadata=protein_meta)
>>> report = cpd_viewer.process(cohort)
>>> report  
Cohort

T-box transcription factor TBX5: NP_852259.1

T-box transcription factor TBX5 (NP_852259.1) has 2 annotated protein features. 32 were found to be located within these features, and 113 were located elsewhere. The total length of the protein is 518 amino acid residues.
Name Type Coordinates Count Variants
Disordered REGION 1 - 46 13 p.Ala34GlyfsTer27; p.Ala34ProfsTer32; p.Ser36ThrfsTer25; p.Pro14Thr
Disordered REGION 250 - 356 19 p.Gln302ArgfsTer92; p.Gln315ArgfsTer79; p.Arg279Ter; p.Tyr342ThrfsTer52; p.Val263Met; p.Ser261Cys; p.Val267TrpfsTer127; p.Tyr291Ter; p.Glu294Ter

Plot distribution of variants with respect to the protein sequence

We use Matplotlib to plot the distribution of variants on a protein diagram:

>>> import matplotlib.pyplot as plt
>>> from gpsea.view import ProteinVisualizer
>>> fig, ax = plt.subplots(figsize=(15, 8))
>>> visualizer = ProteinVisualizer()
>>> visualizer.draw_protein_diagram(
...     tx_coordinates,
...     protein_meta,
...     cohort,
...     ax=ax,
... )
TBX5 protein diagram