Cohort exploratory analysis

Background

As a general rule for statistical testing, it is preferable to formulate one or several hypotheses and to test these in a targeted way. Performing numerous tests without any specific plan comes with an increased danger of false-positive results (although it may be appropriate to generate hypotheses if a validation cohort is available).

Types of statistical tests

GPSEA offers four statistical tests:

  • Fisher Exact Test (FET): Association between two categorical variables. In this case, the first categorical variable is presence/absense of a genotype and the second is presence/absense of annotation to an HPO term.

  • Mann Whitney U test: Association between presence/absence of a genotype with a phenotype score (e.g., De Vries Score).

  • t test: Association between presence/absence of a genotype with a numerical test result (e.g., potassium level).

  • log rank test: Association between presence/absence of a genotype with age-of-onset of a disease or of an HPO feature or with mortality.

Please consult the documentation pages for each of these tests if you are unsure which to use.

Power

Two things must be true to identify a valid genotype-phenotype correlation. First, and obviously, the correlation must actually exist in the dataset being analyzed. Second, there must be statistical power to identify the effect. Power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present (Wikipedia).

It will generally not be possible to perform a formal power calculation because the effect size of most genotype-phenotype correlations is not well known. But, for instance it would not make much sense to test for correlation with a specific variant that occurs in only one of 50 individuals in a cohort.

Exploration

The purpose of the exploratory analysis is thus to decide which tests to perform. GPSEA provides tables and graphics that visualize some of the salient aspects of the cohort and the distribution of the identified variants. We exemplify the exploratory analysis on a cohort of 156 individuals with mutations in TBX5 from Phenopacket Store 0.1.20. The cohort was preprocessed as described in the Input data section and stored at docs/cohort-data/TBX5.0.1.20.json in JSON format.

We start by loading the cohort from the JSON file:

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort = 'docs/cohort-data/TBX5.0.1.20.json'
>>> with open(fpath_cohort) as fh:
...     cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
156

Then we choose the transcript and protein identifiers, and we fetch the corresponding data: (see Choose the transcript and protein of interest for more info):

>>> from gpsea.preprocessing import configure_default_tx_coordinate_service, configure_default_protein_metadata_service
>>> tx_id = "NM_181486.4"
>>> pt_id = "NP_852259.1"
>>> tx_service = configure_default_tx_coordinate_service(genome_build="GRCh38.p13")
>>> tx_coordinates = tx_service.fetch(tx_id)
>>> pm_service = configure_default_protein_metadata_service()
>>> protein_meta = pm_service.annotate(pt_id)

Last, we load HPO v2024-07-01 to use in the exploratory analysis:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')

Interactive exploration

The code for generating reports with tables and figures is located in gpsea.view module. As a rule of thumb, the reports leverage IPython’s “magic” to integrate with the environments such as Jupyter notebook. Returning the report object as the last cell expression will render the report in the cell output.

Cohort summary

We recommend that users start be generating a cohort summary with an overview about the HPO terms, variants, diseases, and variant effects that occur most frequently. We create the viewer and we generate the report for a cohort and transcript. The implicit return of the report at the end of the cell renders the report in the cell output:

>>> from gpsea.view import CohortViewer
>>> viewer = CohortViewer(hpo)
>>> report = viewer.process(cohort=cohort, transcript_id=tx_id)
>>> report  

GPSEA cohort analysis

Successfully loaded 156 individuals. 46 were recorded as male, 56 as female, and 54 as unknown sex. No information about individuals' vital status was reported. 52 individuals had disease onset information and 0 had information about the age of last encounter.

No errors encountered.

Measurements

No data regarding measurement assays were provided.

Diseases

The cohort members were diagnosed with 1 disease.
n Disease
156 Holt-Oram syndrome

Variants

A total of 53 unique variants were identified in the cohort. Variants were annotated with respect to NM_181486.4.

Top 10 variants

n Variant key HGVS Variant Class
22 12_114385521_114385521_C_T c.710G>A (p.Arg237Gln) MISSENSE_VARIANT
20 12_114401830_114401830_C_T c.238G>A (p.Gly80Arg) MISSENSE_VARIANT
8 12_114385563_114385563_G_A c.668C>T (p.Thr223Met) MISSENSE_VARIANT
6 12_114398675_114398675_G_T c.408C>A (p.Tyr136Ter) STOP_GAINED
5 12_114398682_114398682_C_CG c.400dup (p.Arg134ProfsTer49) FRAMESHIFT_VARIANT
5 12_114399514_114399514_A_C c.361T>G (p.Trp121Gly) MISSENSE_VARIANT, SPLICE_REGION_VARIANT
5 12_114403792_114403792_C_CG c.106_107insC (p.Ser36ThrfsTer25) FRAMESHIFT_VARIANT
4 12_114385522_114385522_G_A c.709C>T (p.Arg237Trp) MISSENSE_VARIANT
4 12_114398656_114398656_C_CG c.426dup (p.Ala143ArgfsTer40) FRAMESHIFT_VARIANT
4 12_114403798_114403798_G_GC c.100dup (p.Ala34GlyfsTer27) FRAMESHIFT_VARIANT

Variant effects

The effects were predicted for NM_181486.4

Variant effect Count
MISSENSE_VARIANT 85 (50%)
FRAMESHIFT_VARIANT 38 (22%)
STOP_GAINED 19 (11%)
SPLICE_REGION_VARIANT 10 (6%)
SPLICE_DONOR_VARIANT 7 (4%)
INFRAME_INSERTION 2 (1%)
STOP_RETAINED_VARIANT 2 (1%)
SPLICE_ACCEPTOR_VARIANT 2 (1%)
SPLICE_DONOR_5TH_BASE_VARIANT 2 (1%)
INTRON_VARIANT 2 (1%)
INFRAME_DELETION 1 (1%)

Distribution of variants across protein domains

GPSEA gathers information about protein domains from the UniProt API, and alternatively allows users to enter domain information manually (See Fetch protein data). Protein domains are distinct functional or structural units in a protein. For instance, the following graphic shows domains of the PLD1 protein. The HKD domains contribute to the catalytic activity of the protein whereas the PX and PH domains regulation PLD1 localization within the cell. Observations such as this may suggest testable hypotheses.

PLD1

Human PLD1 with PX, PH, and two HKD domains.

Users can create a table to display the protein domains and the variants located in them in order to decide whether it might be sensible to test for correlation between variants located in one or more protein domains and a certain phenotype.

This code will produce the following table on the basis of a cohort of individuals with variants in the TBX5 gene:

>>> from gpsea.view import ProteinVariantViewer
>>> cpd_viewer = ProteinVariantViewer(tx_id=tx_id, protein_metadata=protein_meta)
>>> report = cpd_viewer.process(cohort)
>>> report  

T-box transcription factor TBX5: NP_852259.1

T-box transcription factor TBX5 has 5 annotated protein featureses. 139 were found to be located within these features, and 6 were located elsewhere. The total length of the protein is 518 amino acid residues.

Count Name Type Coordinates Variants
13 Disordered Region 2 - 46 p.Ala34GlyfsTer27; p.Ala34ProfsTer32; p.Ser36ThrfsTer25; p.Pro14Thr
107 T-box DNA binding 59 - 238 p.Pro139GlnfsTer11; p.Ile106Val; p.Phe168LeufsTer6; p.Ser196Ter; p.Arg237Trp; p.Arg237Pro; p.Gly195Ala; p.Ala143ArgfsTer40; p.Thr161Pro; p.Leu65GlnfsTer10; p.Met83PhefsTer6; p.Leu94Arg; p.His220del; p.Tyr136Ter; p.Gln151Ter; p.Ile227_Glu228insTer; p.Arg237Gln; p.Arg81Trp; p.Trp121Gly; p.Thr72Lys; p.Lys226Asn; p.Met74Ile; p.Gly125AlafsTer25; p.Gly80Arg; p.Arg134ProfsTer49; p.Pro85Thr; p.Val214GlyfsTer12; p.Lys88Ter; p.Val153SerfsTer21; p.Thr223Met
19 Disordered Region 251 - 356 p.Tyr291Ter; p.Gln302ArgfsTer92; p.Val267TrpfsTer127; p.Val263Met; p.Gln315ArgfsTer79; p.Ser261Cys; p.Glu294Ter; p.Arg279Ter; p.Tyr342ThrfsTer52
8 Polar residues Compositional bias 264 - 299 p.Arg279Ter; p.Tyr291Ter; p.Glu294Ter; p.Val267TrpfsTer127
1 Basic and acidic residues Compositional bias 321 - 346 p.Tyr342ThrfsTer52

Plot distribution of variants with respect to the protein sequence

We use Matplotlib to plot the distribution of variants on a protein diagram:

>>> import matplotlib.pyplot as plt
>>> from gpsea.view import ProteinVisualizer
>>> fig, ax = plt.subplots(figsize=(15, 8))
>>> visualizer = ProteinVisualizer()
>>> visualizer.draw_protein_diagram(
...     tx_coordinates,
...     protein_meta,
...     cohort,
...     ax=ax,
... )
TBX5 protein diagram