Cohort exploratory analysis
As a general rule for statistical testing, it is preferable to formulate one or several hypotheses and to test these in a targeted way. Performing numerous tests without any specific plan comes with an increased danger of false-positive results (although it many be appropriate to generate hypotheses if a validation cohort is available).
Types of statistical test
GPSEA offers four statistical tests.
Fisher Exact Test (FET): Association between two categorical variables. In this case, the first categorical variable is presence/absense of a genotype and the second is presence/absense of annotation to an HPO term.
Mann Whitney U test: Association between presence/absence of a genotype with a phenotype score (e.g., devries).
t test: Association between presence/absence of a genotype with a numerical test result (e.g., potassium level)
log rank score: Association between presence/absence of a genotype with age-of-onset of disease or of an HPO feature or with mortality
Please consult the documentation pages for each of these tests if you are unsure which to use.
Power
Two things must be true to identify a valid genotype-phenotype correlation. First, and obviously, the correlation must actually exist in the dataset being analyzed. Second, there must be statistical power to identify the effect. Power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present (Wikipedia).
It will generally not be possible to perform a formal power calculation because the effect size of most genotype-phenotype correlations is not well known. But, for instance it would not make much sense to test for correlation with a specific variant that occurs in only one of 50 individuals in a cohort.
Exploration
The purpose of the exploratory analysis is thus to decide which tests to perform. GPSEA provides tables and graphics that visualize some of the salient aspects of the cohort and the distribution of the identified variants. We exemplify the exploratory analysis on a cohort of 156 individuals with mutations in TBX5 from Phenopacket Store 0.1.20. The cohort was preprocessed as described in the Input data section and stored at docs/cohort-data/TBX5.0.1.20.json in JSON format.
We start by loading the cohort from the JSON file:
>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort = 'docs/cohort-data/TBX5.0.1.20.json'
>>> with open(fpath_cohort) as fh:
... cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
156
then we will choose the transcript and protein identifiers, and we fetch the corresponding data: (see Choose the transcript and protein of interest for more info):
>>> from gpsea.model.genome import GRCh38
>>> from gpsea.preprocessing import VVMultiCoordinateService, configure_default_protein_metadata_service
>>> tx_id = "NM_181486.4"
>>> pt_id = "NP_852259.1"
>>> tx_service = VVMultiCoordinateService(genome_build=GRCh38)
>>> tx_coordinates = tx_service.fetch(tx_id)
>>> pm_service = configure_default_protein_metadata_service()
>>> protein_meta = pm_service.annotate(pt_id)
and we will load HPO v2024-07-01 for using in the exploratory analysis:
>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')
Interactive exploration
We designed GPSEA to integrate with interactive Python with a widespread use in contemporary scientific workflows.
The code for exploration is located in gpsea.view
module.
As a rule of thumb, the reports are provided as GpseaReport
which leverages IPython’s “magic” to integrate with environments such as Jupyter notebook.
Cohort summary
We recommend that users start be generating a cohort summary with an overview about the HPO terms, variants, diseases, and variant effects that occur most frequently:
>>> from gpsea.view import CohortViewer
>>> viewer = CohortViewer(hpo)
>>> report = viewer.process(cohort=cohort, transcript_id=tx_id)
>>> report
GPSEA cohort analysis
Successfully loaded 156 individuals. 46 were recorded as male, 56 as female, and 54 as unknown sex. 156 were reported to be alive at the time of last encounter, and 0 were deceased. 1 had disease onset information and 51 had information about the age at last encounter. In the cohort, there were a total of 117 HPO terms used for annotation.
No errors encountered.
HPO Term | ID | Seen in n individuals |
---|---|---|
Atrial septal defect | HP:0001631 | 50 |
Ventricular septal defect | HP:0001629 | 41 |
Hypoplasia of the radius | HP:0002984 | 40 |
Triphalangeal thumb | HP:0001199 | 36 |
Absent thumb | HP:0009777 | 32 |
Short thumb | HP:0009778 | 32 |
Abnormal carpal morphology | HP:0001191 | 30 |
Secundum atrial septal defect | HP:0001684 | 27 |
Absent radius | HP:0003974 | 15 |
Cardiac conduction abnormality | HP:0031546 | 14 |
Seen in n individuals | Variant key | Variant Name | Protein Variant | Variant Class |
---|---|---|---|---|
22 | 12_114385521_114385521_C_T | c.710G>A | p.Arg237Gln | MISSENSE_VARIANT |
20 | 12_114401830_114401830_C_T | c.238G>A | p.Gly80Arg | MISSENSE_VARIANT |
8 | 12_114385563_114385563_G_A | c.668C>T | p.Thr223Met | MISSENSE_VARIANT |
6 | 12_114398675_114398675_G_T | c.408C>A | p.Tyr136Ter | STOP_GAINED |
5 | 12_114398682_114398682_C_CG | c.400dup | p.Arg134ProfsTer49 | FRAMESHIFT_VARIANT |
5 | 12_114399514_114399514_A_C | c.361T>G | p.Trp121Gly | MISSENSE_VARIANT, SPLICE_REGION_VARIANT |
5 | 12_114403792_114403792_C_CG | c.106_107insC | p.Ser36ThrfsTer25 | FRAMESHIFT_VARIANT |
4 | 12_114385522_114385522_G_A | c.709C>T | p.Arg237Trp | MISSENSE_VARIANT |
4 | 12_114398656_114398656_C_CG | c.426dup | p.Ala143ArgfsTer40 | FRAMESHIFT_VARIANT |
4 | 12_114403798_114403798_G_GC | c.100dup | p.Ala34GlyfsTer27 | FRAMESHIFT_VARIANT |
Disease Name | Disease ID | Annotation Count |
---|---|---|
Holt-Oram syndrome | OMIM:142900 | 156 |
Variant effect | Annotation Count | Percent |
---|---|---|
MISSENSE_VARIANT | 85 | 50% |
FRAMESHIFT_VARIANT | 38 | 22% |
STOP_GAINED | 19 | 11% |
SPLICE_REGION_VARIANT | 10 | 6% |
SPLICE_DONOR_VARIANT | 7 | 4% |
SPLICE_ACCEPTOR_VARIANT | 2 | 1% |
SPLICE_DONOR_5TH_BASE_VARIANT | 2 | 1% |
INTRON_VARIANT | 2 | 1% |
INFRAME_INSERTION | 2 | 1% |
STOP_RETAINED_VARIANT | 2 | 1% |
INFRAME_DELETION | 1 | 1% |
Distribution of variants across protein domains
GPSEA gathers information about protein domains from the UniProt API, and alternatively allows users to enter domain information manually (See Fetch protein data). Protein domains are distinct functional or structural units in a protein. For instance, the following graphic shows domains of the PLD1 protein. The HKD domains contribute to the catalytic activity of the protein whereas the PX and PH domains regulation PLD1 localization within the cell. Observations such as this may suggest testable hypotheses.
Users can create a table to display the protein domains and the variants located in them in order to decide whether it might be sensible to test for correlation between variants located in one or more protein domains and a certain phenotype.
This code will produce the following table on the basis of a cohort of individuals with variants in the TBX5 gene:
>>> from gpsea.view import ProteinVariantViewer
>>> cpd_viewer = ProteinVariantViewer(tx_id=tx_id, protein_metadata=protein_meta)
>>> report = cpd_viewer.process(cohort)
>>> report
T-box transcription factor TBX5: NP_852259.1
Name | Type | Coordinates | Count | Variants |
---|---|---|---|---|
Disordered | REGION | 1 - 46 | 13 | p.Ala34GlyfsTer27; p.Ala34ProfsTer32; p.Ser36ThrfsTer25; p.Pro14Thr |
Disordered | REGION | 250 - 356 | 19 | p.Gln302ArgfsTer92; p.Gln315ArgfsTer79; p.Arg279Ter; p.Tyr342ThrfsTer52; p.Val263Met; p.Ser261Cys; p.Val267TrpfsTer127; p.Tyr291Ter; p.Glu294Ter |
Plot distribution of variants with respect to the protein sequence
We use Matplotlib to plot the distribution of variants on a protein diagram:
>>> import matplotlib.pyplot as plt
>>> from gpsea.view import ProteinVisualizer
>>> fig, ax = plt.subplots(figsize=(15, 8))
>>> visualizer = ProteinVisualizer()
>>> visualizer.draw_protein_diagram(
... tx_coordinates,
... protein_meta,
... cohort,
... ax=ax,
... )