Tutorial
Here we demonstrate an end-to-end genotype-phenotype analysis with GPSEA. The tutorial illustrates just one of the many ways GPSEA can be used to characterize genotype-phenotype correlations. The User guide contains details about additional methods and functionalities.
The tutorial presents an analysis of a cohort of individuals with pathogenic variants in TBX5 leading to Holt-Oram syndrome MIM:142900.
Holt-Oram syndrome is an autosomal dominant disorder characterized by upper limb defects, congenital heart defects, and arrhythmias (PMID:38336121). It has been observed in the literature that congenital defects of the ventricular and atrial septum are more common in the truncating than in the missense variants (PMID:30552424). Additionally, upper limb defects are more frequent in patients with protein-truncating variants (PMID:38336121).
To perform the analysis, we curated the literature and created on GA4GH phenopacket for each affected individual. The phenopackets are made available in Phenopacket Store.
The analysis
For the analysis, the MANE transcript (i.e., the “main” biomedically relevant transcript of a gene) should be chosen unless there is a specific reason not to (which should occur rarely if at all).
In the case of TBX5 the MANE transcript is NM_181486.4. Note that the trascript identifier (NM_181486) and the version (4) are both required. A good way to find the MANE transcript is to search on the gene symbol (e.g., TBX5) in ClinVar and to choose a variant that is specifically located in the gene. The MANE transcript will be displayed here (e.g., NM_181486.4(TBX5):c.1221C>G (p.Tyr407Ter)).
We additionally need the corresponding protein identifier. A good way to find this is to search on the transcript id in NCBI Nucleotide. In our case, search on NM_181486.4 will bring us to this page. If we search within this page for “NP_”, this will bring us to the corresponding protein accession NP_852259.1.
>>> cohort_name = 'TBX5'
>>> tx_id = 'NM_181486.4'
>>> px_id = 'NP_852259.1'
Load HPO
GPSEA needs HPO to do the analysis. We use HPO toolkit to load HPO version v2024-07-01:
>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')
Tip
Use the latest HPO release by omitting the release option.
Prepare cohort
Now we will load the samples to analyze. We will use the cohort of 156 individuals with mutations in TBX5 whose clinical signs and symptoms were encoded into HPO terms and stored in Phenopacket Store.
>>> from ppktstore.registry import configure_phenopacket_registry
>>> phenopacket_registry = configure_phenopacket_registry()
>>> with phenopacket_registry.open_phenopacket_store("0.1.20") as ps:
... phenopackets = tuple(ps.iter_cohort_phenopackets(cohort_name))
>>> len(phenopackets)
156
We loaded 156 phenopackets which need further preprocessing to prepare for the analysis.
We will compute functional annotations for the mutations and then include the individuals into
a Cohort
:
>>> from gpsea.preprocessing import configure_caching_cohort_creator, load_phenopackets
>>> cohort_creator = configure_caching_cohort_creator(hpo)
>>> cohort, validation = load_phenopackets(
... phenopackets=phenopackets,
... cohort_creator=cohort_creator,
... )
Individuals Processed: ...
and we will check that there are no Q/C issues:
>>> validation.summarize()
Validated under none policy
No errors or warnings were found
We loaded the patient data into a cohort which is ready for the next steps.
See also
Here we show how to create a Cohort
from phenopackets.
See Input data section to learn how to create a cohort from another inputs.
Explore cohort
GPSEA helps with gaining insight into the cohort by providing
Show cohort summary
The summary report provides an overview about the HPO terms, variants, diseases, and variant effects that occur most frequently:
>>> from gpsea.view import CohortViewer
>>> viewer = CohortViewer(hpo)
>>> report = viewer.process(cohort=cohort, transcript_id=tx_id)
>>> report
GPSEA cohort analysis
Successfully loaded 156 individuals. 46 were recorded as male, 56 as female, and 54 as unknown sex. 156 were reported to be alive at the time of last encounter, and 0 were deceased. 1 had disease onset information and 51 had information about the age at last encounter. In the cohort, there were a total of 117 HPO terms used for annotation.
No errors encountered.
HPO Term | ID | Seen in n individuals |
---|---|---|
Atrial septal defect | HP:0001631 | 50 |
Ventricular septal defect | HP:0001629 | 41 |
Hypoplasia of the radius | HP:0002984 | 40 |
Triphalangeal thumb | HP:0001199 | 36 |
Short thumb | HP:0009778 | 32 |
Absent thumb | HP:0009777 | 32 |
Abnormal carpal morphology | HP:0001191 | 30 |
Secundum atrial septal defect | HP:0001684 | 27 |
Absent radius | HP:0003974 | 15 |
Cardiac conduction abnormality | HP:0031546 | 14 |
Seen in n individuals | Variant key | Variant Name | Protein Variant | Variant Class |
---|---|---|---|---|
22 | 12_114385521_114385521_C_T | c.710G>A | p.Arg237Gln | MISSENSE_VARIANT |
20 | 12_114401830_114401830_C_T | c.238G>A | p.Gly80Arg | MISSENSE_VARIANT |
8 | 12_114385563_114385563_G_A | c.668C>T | p.Thr223Met | MISSENSE_VARIANT |
6 | 12_114398675_114398675_G_T | c.408C>A | p.Tyr136Ter | STOP_GAINED |
5 | 12_114399514_114399514_A_C | c.361T>G | p.Trp121Gly | MISSENSE_VARIANT, SPLICE_REGION_VARIANT |
5 | 12_114398682_114398682_C_CG | c.400dup | p.Arg134ProfsTer49 | FRAMESHIFT_VARIANT |
5 | 12_114403792_114403792_C_CG | c.106_107insC | p.Ser36ThrfsTer25 | FRAMESHIFT_VARIANT |
4 | 12_114385474_114385474_A_G | c.755+2T>C | None | SPLICE_DONOR_VARIANT |
4 | 12_114398656_114398656_C_CG | c.426dup | p.Ala143ArgfsTer40 | FRAMESHIFT_VARIANT |
4 | 12_114403798_114403798_G_GC | c.100dup | p.Ala34GlyfsTer27 | FRAMESHIFT_VARIANT |
Disease Name | Disease ID | Annotation Count |
---|---|---|
Holt-Oram syndrome | OMIM:142900 | 156 |
Variant effect | Annotation Count | Percent |
---|---|---|
MISSENSE_VARIANT | 85 | 50% |
FRAMESHIFT_VARIANT | 38 | 22% |
STOP_GAINED | 19 | 11% |
SPLICE_REGION_VARIANT | 10 | 6% |
SPLICE_DONOR_VARIANT | 7 | 4% |
INFRAME_INSERTION | 2 | 1% |
STOP_RETAINED_VARIANT | 2 | 1% |
SPLICE_ACCEPTOR_VARIANT | 2 | 1% |
SPLICE_DONOR_5TH_BASE_VARIANT | 2 | 1% |
INTRON_VARIANT | 2 | 1% |
INFRAME_DELETION | 1 | 1% |
Plot distribution of variants with respect to the protein sequence
Now we can show the distribution of variants with respect to the encoded protein.
We first obtain tx_coordinates (TranscriptCoordinates
)
and protein_meta (ProteinMetadata
)
with information about the transcript and protein “anatomy”:
>>> from gpsea.model.genome import GRCh38
>>> from gpsea.preprocessing import configure_protein_metadata_service, VVMultiCoordinateService
>>> txc_service = VVMultiCoordinateService(genome_build=GRCh38)
>>> pms = configure_protein_metadata_service()
>>> tx_coordinates = txc_service.fetch(tx_id)
>>> protein_meta = pms.annotate(px_id)
and we follow with plotting the diagram of the mutations on the protein:
>>> from gpsea.view import ProteinVisualizer
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(15, 8))
>>> visualizer = ProteinVisualizer()
>>> visualizer.draw_protein_diagram(
... tx_coordinates,
... protein_meta,
... cohort,
... ax=ax,
... )
Summarize all variant alleles
We can prepare a table of all variant alleles that occurr in the cohort. Each table row corresponds to a single allele and lists the variant key, the predicted effect on the transcript (cDNA) and protein of interest, the variant effects, and the number of patients who present with one or more variant alleles (Count):
>>> from gpsea.view import CohortVariantViewer
>>> viewer = CohortVariantViewer(tx_id=tx_id)
>>> report = viewer.process(cohort=cohort)
>>> report
GPSEA cohort analysis: All variant alleles
Variant key | Variant (cDNA) | Variant (protein) | Effects | Count |
---|---|---|---|---|
12_114385521_114385521_C_T | c.710G>A | p.Arg237Gln | missense | 22 |
12_114401830_114401830_C_T | c.238G>A | p.Gly80Arg | missense | 20 |
12_114385563_114385563_G_A | c.668C>T | p.Thr223Met | missense | 8 |
12_114398675_114398675_G_T | c.408C>A | p.Tyr136Ter | stop gained | 6 |
12_114399514_114399514_A_C | c.361T>G | p.Trp121Gly | missense, splice region | 5 |
12_114403792_114403792_C_CG | c.106_107insC | p.Ser36ThrfsTer25 | frameshift | 5 |
12_114398682_114398682_C_CG | c.400dup | p.Arg134ProfsTer49 | frameshift | 5 |
12_114385474_114385474_A_G | c.755+2T>C | None | splice donor | 4 |
12_114398656_114398656_C_CG | c.426dup | p.Ala143ArgfsTer40 | frameshift | 4 |
12_114366360_114366360_C_T | c.787G>A | p.Val263Met | missense | 4 |
12_114385522_114385522_G_A | c.709C>T | p.Arg237Trp | missense | 4 |
12_114403798_114403798_G_GC | c.100dup | p.Ala34GlyfsTer27 | frameshift | 4 |
12_114399613_114399613_T_A | c.262A>T | p.Lys88Ter | stop gained | 3 |
12_114401827_114401827_T_A | c.241A>T | p.Arg81Trp | missense, splice region | 3 |
12_114366312_114366312_G_A | c.835C>T | p.Arg279Ter | stop gained | 3 |
12_114366366_114366366_T_A | c.781A>T | p.Ser261Cys | missense | 3 |
12_114403798_114403799_GC_G | c.100del | p.Ala34ProfsTer32 | frameshift | 3 |
12_114385475_114385475_C_T | c.755+1G>A | None | splice donor | 3 |
12_114401853_114401853_G_T | c.215C>A | p.Thr72Lys | missense | 3 |
12_114385521_114385521_C_G | c.710G>C | p.Arg237Pro | missense | 2 |
12_114398568_114398568_C_A | c.510+5G>T | None | splice donor 5th base, intronic | 2 |
12_114366207_114366208_GC_G | c.939del | p.Gln315ArgfsTer79 | frameshift | 2 |
12_114366274_114366274_G_T | c.873C>A | p.Tyr291Ter | stop gained | 2 |
12_114366267_114366267_C_A | c.880G>T | p.Glu294Ter | stop gained | 2 |
12_114398578_114398579_CA_C | c.504del | p.Phe168LeufsTer6 | frameshift | 2 |
12_114394762_114394763_CA_C | c.641del | p.Val214GlyfsTer12 | frameshift | 2 |
12_114398666_114398667_TG_T | c.416del | p.Pro139GlnfsTer11 | frameshift | 2 |
12_114403754_114403754_G_T | c.145C>A | p.Gln49Lys | missense, splice region | 2 |
12_114385550_114385550_A_AATTATTCTCAG | c.680_681insCTGAGAATAAT | p.Ile227_Glu228insTer | inframe insertion, stop retainined | 2 |
12_114366348_114366349_CT_C | c.798del | p.Val267TrpfsTer127 | frameshift | 1 |
12_114399559_114399559_T_C | c.316A>G | p.Ile106Val | missense | 1 |
12_114356064_114356065_TA_T | c.1024del | p.Tyr342ThrfsTer52 | frameshift | 1 |
12_114355755_114355756_TG_T | c.1333del | p.His445MetfsTer137 | frameshift | 1 |
12_114398632_114398632_G_A | c.451C>T | p.Gln151Ter | stop gained | 1 |
12_114401907_114401907_A_G | c.161T>C | p.Ile54Thr | missense | 1 |
12_114355723_114355723_G_A | c.1366C>T | p.Gln456Ter | stop gained | 1 |
12_114398602_114398602_T_G | c.481A>C | p.Thr161Pro | missense | 1 |
12_114398708_114398709_GC_G | c.374del | p.Gly125AlafsTer25 | frameshift | 1 |
12_114385553_114385553_C_A | c.678G>T | p.Lys226Asn | missense | 1 |
12_114399633_114399633_C_G | c.243-1G>C | None | splice acceptor | 1 |
12_114401873_114401874_TA_T | c.194del | p.Leu65GlnfsTer10 | frameshift | 1 |
12_114399594_114399594_A_C | c.281T>G | p.Leu94Arg | missense | 1 |
12_114399622_114399622_G_T | c.253C>A | p.Pro85Thr | missense | 1 |
12_114394743_114394746_TGTG_T | c.658_660del | p.His220del | inframe deletion | 1 |
12_114394820_114394820_C_G | c.584G>C | p.Gly195Ala | missense | 1 |
12_114401846_114401846_C_G | c.222G>C | p.Met74Ile | missense | 1 |
12_114366241_114366242_CT_C | c.905del | p.Gln302ArgfsTer92 | frameshift | 1 |
12_114355784_114355785_CA_C | c.1304del | p.Leu435ArgfsTer147 | frameshift | 1 |
12_114398626_114398627_CG_C | c.456del | p.Val153SerfsTer21 | frameshift | 1 |
12_114403859_114403859_G_T | c.40C>A | p.Pro14Thr | missense | 1 |
12_114399625_114399629_ACATC_A | c.246_249del | p.Met83PhefsTer6 | frameshift | 1 |
12_114394817_114394817_G_C | c.587C>G | p.Ser196Ter | stop gained | 1 |
12_114401921_114401921_C_G | c.148-1G>C | None | splice acceptor | 1 |
Prepare genotype and phenotype predicates
We will create a predicate to bin patients into group depending on presence of a single allele of a missense or frameshift variant to test if there is a difference between frameshift and non-frameshift variants in the individuals of the TBX5 cohort.
>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate.genotype import VariantPredicates, monoallelic_predicate
>>> gt_predicate = monoallelic_predicate(
... a_predicate=VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id),
... b_predicate=VariantPredicates.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id),
... a_label="Missense", b_label="Frameshift",
... )
>>> gt_predicate.group_labels
('Missense', 'Frameshift')
Note
There are many other ways to set up a predicate for testing for a GP correlation. See the Predicates section to learn more about building a predicate of interest.
The phenotype grouping is based on presence or absence of an HPO term. We take advantage of the ontology “true path rule” to infer presence of the ancestor terms for all present HPO terms (e.g. presence of Abnormal ventricular septum morphology in an individual annotated with Ventricular septal defect) and exclusion of the descendant terms for all excluded terms (e.g. exclusion of Motor seizure in an individual where Seizure was excluded):
>>> from gpsea.analysis.predicate.phenotype import prepare_predicates_for_terms_of_interest
>>> pheno_predicates = prepare_predicates_for_terms_of_interest(
... cohort=cohort,
... hpo=hpo,
... )
By default, GPSEA will perform one hypothesis test for each HPO term used to annotate at least one individual in the cohort. Testing multiple hypothesis on the same dataset increases the chance of receiving false positive result. However, GPSEA simplifies the application of an appropriate multiple testing correction.
For general use, we recommend using a combination
of a phenotype MT filter (PhenotypeMtcFilter
) with a multiple testing correction.
Phenotype MT filter chooses the HPO terms to test according to several heuristics, which
reduce the multiple testing burden and focus the analysis
on the most interesting terms (see HPO MT filter for more info).
Then the multiple testing correction, such as Bonferroni or Benjamini-Hochberg,
is used to control the family-wise error rate or the false discovery rate.
See Multiple-testing correction for more information.
>>> from gpsea.analysis.pcats import configure_hpo_term_analysis
>>> analysis = configure_hpo_term_analysis(hpo)
configure_hpo_term_analysis()
configures the analysis
that uses HPO MTC filter (HpoMtcFilter
) for selecting HPO terms of interest,
Fisher Exact test for computing nominal p values, and Benjamini-Hochberg for multiple testing correction.
Now we can perform the analysis and investigate the results.
>>> result = analysis.compare_genotype_vs_phenotypes(
... cohort=cohort,
... gt_predicate=gt_predicate,
... pheno_predicates=pheno_predicates,
... )
>>> result.total_tests
16
We only tested 16 HPO terms. This is despite the individuals being collectively annotated with 369 direct and indirect HPO terms
>>> len(result.phenotypes)
369
We can show the reasoning behind not testing 353 (369 - 16) HPO terms by exploring the phenotype MTC filtering report.
>>> from gpsea.view import MtcStatsViewer
>>> mtc_viewer = MtcStatsViewer()
>>> mtc_report = mtc_viewer.process(result)
>>> mtc_report
Phenotype testing report
Phenotype MTC filter: HPO MTC filter
Multiple testing correction: fdr_bh
Performed statistical tests for 16 out of the total of 369 HPO terms.
Code | Reason | Count |
---|---|---|
HMF01 | Skipping term with maximum frequency that was less than threshold 0.4 | 94 |
HMF08 | Skipping general term | 48 |
HMF09 | Skipping term with maximum annotation frequency that was less than threshold 0.4 | 211 |
and these are the tested HPO terms ordered by the p value corrected with the Benjamini-Hochberg procedure:
>>> from gpsea.view import summarize_hpo_analysis
>>> summary_df = summarize_hpo_analysis(hpo, result)
>>> summary_df
Allele group |
Missense |
Missense |
Frameshift |
Frameshift |
||
---|---|---|---|---|---|---|
Count |
Percent |
Count |
Percent |
Corrected p values |
p values |
|
Ventricular septal defect [HP:0001629] |
31/60 |
52% |
19/19 |
100% |
0.0008990549794102921 |
5.6190936213143254e-05 |
Absent thumb [HP:0009777] |
12/71 |
17% |
14/31 |
45% |
0.04502808125400047 |
0.005628510156750059 |
Triphalangeal thumb [HP:0001199] |
13/72 |
18% |
13/32 |
41% |
0.13654199434471745 |
0.02560162393963452 |
Hypoplasia of the radius [HP:0002984] |
30/62 |
48% |
6/14 |
43% |
1.0 |
0.7735491022101784 |
Abnormal cardiac septum morphology [HP:0001671] |
62/62 |
100% |
28/28 |
100% |
1.0 |
1.0 |
Abnormal hand morphology [HP:0005922] |
53/53 |
100% |
20/20 |
100% |
1.0 |
1.0 |
Atrial septal defect [HP:0001631] |
42/44 |
95% |
20/20 |
100% |
1.0 |
1.0 |
Abnormal atrial septum morphology [HP:0011994] |
43/43 |
100% |
20/20 |
100% |
1.0 |
1.0 |
Abnormal cardiac atrium morphology [HP:0005120] |
43/43 |
100% |
20/20 |
100% |
1.0 |
1.0 |
Abnormal appendicular skeleton morphology [HP:0011844] |
64/64 |
100% |
34/34 |
100% |
1.0 |
1.0 |
Aplasia/hypoplasia of the extremities [HP:0009815] |
55/55 |
100% |
22/22 |
100% |
1.0 |
1.0 |
Aplasia/hypoplasia involving the skeleton [HP:0009115] |
56/56 |
100% |
23/23 |
100% |
1.0 |
1.0 |
Aplasia/hypoplasia involving bones of the upper limbs [HP:0006496] |
55/55 |
100% |
22/22 |
100% |
1.0 |
1.0 |
Aplasia/hypoplasia involving bones of the extremities [HP:0045060] |
55/55 |
100% |
22/22 |
100% |
1.0 |
1.0 |
Abnormal finger morphology [HP:0001167] |
36/36 |
100% |
31/31 |
100% |
1.0 |
1.0 |
Abnormal digit morphology [HP:0011297] |
38/38 |
100% |
33/33 |
100% |
1.0 |
1.0 |
We see that several HPO terms are significantly associated with presence of a frameshift variant in TBX5. For example, Ventricular septal defect was observed in 31/60 (52%) patients with a missense variant but it was observed in 19/19 (100%) patients with a frameshift variant. Fisher exact test computed a p value of ~0.0000562 and the p value corrected by Benjamini-Hochberg procedure is ~0.000899.