Tutorial

Here we demonstrate an end-to-end genotype-phenotype analysis with GPSEA. The tutorial illustrates just one of the many ways GPSEA can be used to characterize genotype-phenotype correlations. The User guide contains details about additional methods and functionalities.

The tutorial presents an analysis of a cohort of individuals with pathogenic variants in TBX5 leading to Holt-Oram syndrome MIM:142900.

Holt-Oram syndrome is an autosomal dominant disorder characterized by upper limb defects, congenital heart defects, and arrhythmias (PMID:38336121). It has been observed in the literature that congenital defects of the ventricular and atrial septum are more common in the truncating than in the missense variants (PMID:30552424). Additionally, upper limb defects are more frequent in patients with protein-truncating variants (PMID:38336121).

To perform the analysis, we curated the literature and created on GA4GH phenopacket for each affected individual. The phenopackets are made available in Phenopacket Store.

The analysis

For the analysis, the MANE transcript (i.e., the “main” biomedically relevant transcript of a gene) should be chosen unless there is a specific reason not to (which should occur rarely if at all).

In the case of TBX5 the MANE transcript is NM_181486.4. Note that the trascript identifier (NM_181486) and the version (4) are both required. A good way to find the MANE transcript is to search on the gene symbol (e.g., TBX5) in ClinVar and to choose a variant that is specifically located in the gene. The MANE transcript will be displayed here (e.g., NM_181486.4(TBX5):c.1221C>G (p.Tyr407Ter)).

We additionally need the corresponding protein identifier. A good way to find this is to search on the transcript id in NCBI Nucleotide. In our case, search on NM_181486.4 will bring us to this page. If we search within this page for “NP_”, this will bring us to the corresponding protein accession NP_852259.1.

>>> cohort_name = 'TBX5'
>>> tx_id = 'NM_181486.4'
>>> px_id = 'NP_852259.1'

Load HPO

GPSEA needs HPO to do the analysis. We use HPO toolkit to load HPO version v2024-07-01:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')

Tip

Use the latest HPO release by omitting the release option.

Prepare cohort

Now we will load the samples to analyze. We will use the cohort of 156 individuals with mutations in TBX5 whose clinical signs and symptoms were encoded into HPO terms and stored in Phenopacket Store.

>>> from ppktstore.registry import configure_phenopacket_registry
>>> phenopacket_registry = configure_phenopacket_registry()
>>> with phenopacket_registry.open_phenopacket_store('0.1.18') as ps:
...     phenopackets = tuple(ps.iter_cohort_phenopackets(cohort_name))
>>> len(phenopackets)
156

We loaded 156 phenopackets which need further preprocessing to prepare for the analysis. We will compute functional annotations for the mutations and then include the individuals into a Cohort:

>>> from gpsea.preprocessing import configure_caching_cohort_creator, load_phenopackets
>>> cohort_creator = configure_caching_cohort_creator(hpo)
>>> cohort, validation = load_phenopackets(  
...     phenopackets=phenopackets,
...     cohort_creator=cohort_creator,
... )
Individuals Processed: ...

and we will check that there are no Q/C issues:

>>> validation.summarize()  
Validated under none policy
No errors or warnings were found

We loaded the patient data into a cohort which is ready for the next steps.

See also

Here we show how to create a Cohort from phenopackets. See Input data section to learn how to create a cohort from another inputs.

Explore cohort

GPSEA helps with gaining insight into the cohort by providing

Show cohort summary

The summary report provides an overview about the HPO terms, variants, diseases, and variant effects that occurr most frequently:

>>> from gpsea.view import CohortViewable
>>> viewer = CohortViewable(hpo)
>>> report = viewer.process(cohort=cohort, transcript_id=tx_id)
>>> with open('docs/report/tbx5_cohort_info.html', 'w') as fh:  
...     _ = fh.write(report)
Cohort

GPSEA cohort analysis

Successfully loaded 156 individuals.

No errors encountered.

Top 10 HPO Terms

A total of 116 HPO terms were used to annotated the cohort.
HPO Term ID Seen in n individuals
Atrial septal defect HP:0001631 50
Ventricular septal defect HP:0001629 41
Hypoplasia of the radius HP:0002984 40
Triphalangeal thumb HP:0001199 36
Absent thumb HP:0009777 32
Short thumb HP:0009778 32
Abnormal carpal morphology HP:0001191 30
Secundum atrial septal defect HP:0001684 27
Absent radius HP:0003974 15
Cardiac conduction abnormality HP:0031546 14

Top 10 Variants

Variants are shown according to NM_181486.4. A total of 156 unique variants were identified in the cohort.
Count Variant key Variant Name Protein Variant Variant Class
22 12_114385521_114385521_C_T c.710G>A p.Arg237Gln MISSENSE_VARIANT
20 12_114401830_114401830_C_T c.238G>A p.Gly80Arg MISSENSE_VARIANT
8 12_114385563_114385563_G_A c.668C>T p.Thr223Met MISSENSE_VARIANT
6 12_114398675_114398675_G_T c.408C>A p.Tyr136Ter STOP_GAINED
5 12_114398682_114398682_C_CG c.400dup p.Arg134ProfsTer49 FRAMESHIFT_VARIANT
5 12_114403792_114403792_C_CG c.106_107insC p.Ser36ThrfsTer25 FRAMESHIFT_VARIANT
5 12_114399514_114399514_A_C c.361T>G p.Trp121Gly MISSENSE_VARIANT, SPLICE_REGION_VARIANT
4 12_114385474_114385474_A_G c.755+2T>C None SPLICE_DONOR_VARIANT
4 12_114398656_114398656_C_CG c.426dup p.Ala143ArgfsTer40 FRAMESHIFT_VARIANT
4 12_114366360_114366360_C_T c.787G>A p.Val263Met MISSENSE_VARIANT

Diseases

Disease Name Disease ID Annotation Count
Holt-Oram syndrome OMIM:142900 156

Variant categories for NM_181486.4

Variant effect Annotation Count
FRAMESHIFT_VARIANT 38
MISSENSE_VARIANT 85
STOP_GAINED 19
SPLICE_REGION_VARIANT 10
SPLICE_ACCEPTOR_VARIANT 2
SPLICE_DONOR_VARIANT 7
SPLICE_DONOR_5TH_BASE_VARIANT 2
INTRON_VARIANT 2
INFRAME_INSERTION 2
STOP_RETAINED_VARIANT 2
INFRAME_DELETION 1

Note

The report can also be displayed directly in a Jupyter notebook by running:

from IPython.display import HTML, display
display(HTML(report))

Plot distribution of variants with respect to the protein sequence

Now we can show the distribution of variants with respect to the encoded protein. We first obtain tx_coordinates (TranscriptCoordinates) and protein_meta (ProteinMetadata) with information about the transcript and protein “anatomy”:

>>> from gpsea.model.genome import GRCh38
>>> from gpsea.preprocessing import configure_protein_metadata_service, VVMultiCoordinateService
>>> txc_service = VVMultiCoordinateService(genome_build=GRCh38)
>>> pms = configure_protein_metadata_service()
>>> tx_coordinates = txc_service.fetch(tx_id)
>>> protein_meta = pms.annotate(px_id)

and we follow with plotting the diagram of the mutations on the protein:

>>> from gpsea.view import ProteinVisualizer
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(15, 8))
>>> visualizer = ProteinVisualizer()
>>> visualizer.draw_protein_diagram(
...     tx_coordinates,
...     protein_meta,
...     cohort,
...     ax=ax,
... )
>>> fig.tight_layout()
>>> fig.savefig('docs/img/tutorial/tbx5_protein_diagram.png')  
TBX5 protein diagram

Summarize all variant alleles

We can prepare a table of all variant alleles that occurr in the cohort. Each table row corresponds to a single allele and lists the variant key, the predicted effect on the transcript (cDNA) and protein of interest, the variant effects, and the number of patients who present with one or more variant alleles (Count):

>>> from gpsea.view import CohortVariantViewer
>>> viewer = CohortVariantViewer(tx_id=tx_id)
>>> report = viewer.process(cohort=cohort)
>>> with open('docs/report/tbx5_all_variants.html', 'w') as fh:  
...     _ = fh.write(report)
Cohort

GPSEA cohort analysis: All variant alleles

Variant alleles

A total of 53 unique alleles were identified in the cohort.
Variant key Variant (cDNA) Variant (protein) Effects Count
12_114385521_114385521_C_T c.710G>A p.Arg237Gln missense 22
12_114401830_114401830_C_T c.238G>A p.Gly80Arg missense 20
12_114385563_114385563_G_A c.668C>T p.Thr223Met missense 8
12_114398675_114398675_G_T c.408C>A p.Tyr136Ter stop gained 6
12_114399514_114399514_A_C c.361T>G p.Trp121Gly missense, splice region 5
12_114403792_114403792_C_CG c.106_107insC p.Ser36ThrfsTer25 frameshift 5
12_114398682_114398682_C_CG c.400dup p.Arg134ProfsTer49 frameshift 5
12_114385474_114385474_A_G c.755+2T>C None splice donor 4
12_114398656_114398656_C_CG c.426dup p.Ala143ArgfsTer40 frameshift 4
12_114366360_114366360_C_T c.787G>A p.Val263Met missense 4
12_114385522_114385522_G_A c.709C>T p.Arg237Trp missense 4
12_114403798_114403798_G_GC c.100dup p.Ala34GlyfsTer27 frameshift 4
12_114399613_114399613_T_A c.262A>T p.Lys88Ter stop gained 3
12_114401827_114401827_T_A c.241A>T p.Arg81Trp missense, splice region 3
12_114366312_114366312_G_A c.835C>T p.Arg279Ter stop gained 3
12_114366366_114366366_T_A c.781A>T p.Ser261Cys missense 3
12_114403798_114403799_GC_G c.100del p.Ala34ProfsTer32 frameshift 3
12_114385475_114385475_C_T c.755+1G>A None splice donor 3
12_114401853_114401853_G_T c.215C>A p.Thr72Lys missense 3
12_114385521_114385521_C_G c.710G>C p.Arg237Pro missense 2
12_114398568_114398568_C_A c.510+5G>T None splice donor 5th base, intronic 2
12_114366207_114366208_GC_G c.939del p.Gln315ArgfsTer79 frameshift 2
12_114366274_114366274_G_T c.873C>A p.Tyr291Ter stop gained 2
12_114366267_114366267_C_A c.880G>T p.Glu294Ter stop gained 2
12_114398578_114398579_CA_C c.504del p.Phe168LeufsTer6 frameshift 2
12_114394762_114394763_CA_C c.641del p.Val214GlyfsTer12 frameshift 2
12_114398666_114398667_TG_T c.416del p.Pro139GlnfsTer11 frameshift 2
12_114403754_114403754_G_T c.145C>A p.Gln49Lys missense, splice region 2
12_114385550_114385550_A_AATTATTCTCAG c.680_681insCTGAGAATAAT p.Ile227_Glu228insTer inframe insertion, stop retainined 2
12_114366348_114366349_CT_C c.798del p.Val267TrpfsTer127 frameshift 1
12_114399559_114399559_T_C c.316A>G p.Ile106Val missense 1
12_114356064_114356065_TA_T c.1024del p.Tyr342ThrfsTer52 frameshift 1
12_114355755_114355756_TG_T c.1333del p.His445MetfsTer137 frameshift 1
12_114398632_114398632_G_A c.451C>T p.Gln151Ter stop gained 1
12_114401907_114401907_A_G c.161T>C p.Ile54Thr missense 1
12_114355723_114355723_G_A c.1366C>T p.Gln456Ter stop gained 1
12_114398602_114398602_T_G c.481A>C p.Thr161Pro missense 1
12_114398708_114398709_GC_G c.374del p.Gly125AlafsTer25 frameshift 1
12_114385553_114385553_C_A c.678G>T p.Lys226Asn missense 1
12_114399633_114399633_C_G c.243-1G>C None splice acceptor 1
12_114401873_114401874_TA_T c.194del p.Leu65GlnfsTer10 frameshift 1
12_114399594_114399594_A_C c.281T>G p.Leu94Arg missense 1
12_114399622_114399622_G_T c.253C>A p.Pro85Thr missense 1
12_114394743_114394746_TGTG_T c.658_660del p.His220del inframe deletion 1
12_114394820_114394820_C_G c.584G>C p.Gly195Ala missense 1
12_114401846_114401846_C_G c.222G>C p.Met74Ile missense 1
12_114366241_114366242_CT_C c.905del p.Gln302ArgfsTer92 frameshift 1
12_114355784_114355785_CA_C c.1304del p.Leu435ArgfsTer147 frameshift 1
12_114398626_114398627_CG_C c.456del p.Val153SerfsTer21 frameshift 1
12_114403859_114403859_G_T c.40C>A p.Pro14Thr missense 1
12_114399625_114399629_ACATC_A c.246_249del p.Met83PhefsTer6 frameshift 1
12_114394817_114394817_G_C c.587C>G p.Ser196Ter stop gained 1
12_114401921_114401921_C_G c.148-1G>C None splice acceptor 1

Prepare genotype and phenotype predicates

We will create a predicate to bin patients into group depending on presence of a single allele of a missense or frameshift variant to test if there is a difference between frameshift and non-frameshift variants in the individuals of the TBX5 cohort.

>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate.genotype import VariantPredicates, monoallelic_predicate
>>> gt_predicate = monoallelic_predicate(
...     a_predicate=VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id),
...     b_predicate=VariantPredicates.variant_effect(VariantEffect.FRAMESHIFT_VARIANT, tx_id),
...     names=('Missense', 'Frameshift')
... )
>>> gt_predicate.display_question()
'Allele group: Missense, Frameshift'

Note

There are many other ways to set up a predicate for testing for a GP correlation. See the Predicates section to learn more about building a predicate of interest.

The phenotype grouping is based on presence or absence of an HPO term. We take advantage of the ontology “true path rule” to infer presence of the ancestor terms for all present HPO terms (e.g. presence of Abnormal ventricular septum morphology in an individual annotated with Ventricular septal defect) and exclusion of the descendant terms for all excluded terms (e.g. exclusion of Motor seizure in an individual where Seizure was excluded):

>>> from gpsea.analysis.predicate.phenotype import prepare_predicates_for_terms_of_interest
>>> pheno_predicates = prepare_predicates_for_terms_of_interest(
...     cohort=cohort,
...     hpo=hpo,
...     min_n_of_patients_with_term=2,
... )

By default, GPSEA will perform one hypothesis test for each HPO term used to annotate two or more individuals in the cohort (see min_n_of_patients_with_term=2 above). Testing multiple hypothesis on the same dataset increases the chance of receiving false positive result. However, GPSEA simplifies the application of an appropriate multiple testing correction.

For general use, we recommend using a combination of a Phenotype MTC filter (PhenotypeMtcFilter) with a multiple testing correction. Phenotype MTC filter chooses the HPO terms to test according to several heuristics, which reduce the multiple testing burden and focus the analysis on the most interesting terms (see HPO MTC filter for more info). Then the multiple testing correction, such as Bonferroni or Benjamini-Hochberg, is used to control the family-wise error rate or the false discovery rate. See Multiple-testing correction for more information.

In this example, we will use a combination of the HPO MTC filter (HpoMtcFilter) with Benjamini-Hochberg procedure (mtc_correction='fdr_bh') with a false discovery control level at (mtc_alpha=0.05):

>>> from gpsea.analysis.mtc_filter import HpoMtcFilter
>>> mtc_filter = HpoMtcFilter.default_filter(hpo, term_frequency_threshold=0.2)
>>> mtc_correction = 'fdr_bh'
>>> mtc_alpha = 0.05

Choosing the statistical procedure for assessment of association between genotype and phenotype groups is the last missing piece of the analysis. We will use Fisher Exact Test:

>>> from gpsea.analysis.pcats.stats import FisherExactTest
>>> count_statistic = FisherExactTest()

and we finalize the analysis setup by putting all components together into HpoTermAnalysis:

>>> from gpsea.analysis.pcats import HpoTermAnalysis
>>> analysis = HpoTermAnalysis(
...     count_statistic=count_statistic,
...     mtc_filter=mtc_filter,
...     mtc_correction=mtc_correction,
...     mtc_alpha=mtc_alpha,
... )

Now we can perform the analysis and investigate the results.

>>> result = analysis.compare_genotype_vs_phenotypes(
...     cohort=cohort,
...     gt_predicate=gt_predicate,
...     pheno_predicates=pheno_predicates,
... )
>>> result.total_tests
16

We only tested 16 HPO terms. This is despite the individuals being collectively annotated with 260 direct and indirect HPO terms

>>> len(result.phenotypes)
260

We can show the reasoning behind not testing 244 (260 - 16) HPO terms by exploring the phenotype MTC filtering report.

>>> from gpsea.view import MtcStatsViewer
>>> mtc_viewer = MtcStatsViewer()
>>> mtc_report = mtc_viewer.process(result)
>>> with open('docs/report/tbx5_frameshift_vs_missense.mtc_report.html', 'w') as fh:  
...     _ = fh.write(mtc_report)
Cohort

Phenotype testing report

Phenotype MTC filter: HPO MTC filter

Multiple testing correction: fdr_bh

Performed statistical tests for 16 out of the total of 260 HPO terms.

Using HPO MTC filter, 244 term(s) were omitted from statistical analysis.
Code Reason Count
HMF01 Skipping term with maximum frequency that was less than threshold 0.2 51
HMF02 Skipping term because no genotype has more than one observed HPO count 3
HMF03 Skipping term because of a child term with the same individual counts 1
HMF04 Skipping term because all genotypes have same HPO observed proportions 41
HMF05 Skipping term because one genotype had zero observations 2
HMF06 Skipping term with less than 7 observations (not powered for 2x2) 102
HMF08 Skipping general term 44

and these are the tested HPO terms ordered by the p value corrected with the Benjamini-Hochberg procedure:

>>> from gpsea.view import summarize_hpo_analysis
>>> summary_df = summarize_hpo_analysis(hpo, result)
>>> summary_df.to_csv('docs/report/tbx5_frameshift_vs_missense.csv')  
TBX5 frameshift vs missense

Allele group

Missense

Missense

Frameshift

Frameshift

Count

Percent

Count

Percent

Corrected p values

p values

Ventricular septal defect [HP:0001629]

31/60

52%

19/19

100%

0.0008990549794102921

5.6190936213143254e-05

Abnormal atrioventricular conduction [HP:0005150]

0/22

0%

3/3

100%

0.003478260869565217

0.00043478260869565214

Atrioventricular block [HP:0001678]

0/22

0%

2/2

100%

0.01932367149758454

0.0036231884057971015

Absent thumb [HP:0009777]

12/71

17%

14/31

45%

0.022514040627000236

0.005628510156750059

Patent ductus arteriosus [HP:0001643]

3/37

8%

2/2

100%

0.04318488529014845

0.01349527665317139

Triphalangeal thumb [HP:0001199]

13/72

18%

13/32

41%

0.06827099717235872

0.02560162393963452

Cardiac conduction abnormality [HP:0031546]

14/36

39%

3/3

100%

0.17007174901911745

0.07440639019586388

Secundum atrial septal defect [HP:0001684]

14/35

40%

4/22

18%

0.2849045166157176

0.1424522583078588

Muscular ventricular septal defect [HP:0011623]

6/59

10%

6/25

24%

0.29999225997208373

0.1687456462342971

Pulmonary arterial hypertension [HP:0002092]

4/6

67%

0/2

0%

0.6857142857142857

0.42857142857142855

Hypoplasia of the ulna [HP:0003022]

1/12

8%

2/10

20%

0.831168831168831

0.5714285714285713

Hypoplasia of the radius [HP:0002984]

30/62

48%

6/14

43%

1.0

0.7735491022101784

Short thumb [HP:0009778]

11/41

27%

8/30

27%

1.0

1.0

Atrial septal defect [HP:0001631]

42/44

95%

20/20

100%

1.0

1.0

Absent radius [HP:0003974]

7/32

22%

6/25

24%

1.0

1.0

Short humerus [HP:0005792]

7/17

41%

4/9

44%

1.0

1.0

We see that several HPO terms are significantly associated with presence of a frameshift variant in TBX5. For example, Ventricular septal defect was observed in 31/60 (52%) patients with a missense variant but it was observed in 19/19 (100%) patients with a frameshift variant. Fisher exact test computed a p value of ~0.0000562 and the p value corrected by Benjamini-Hochberg procedure is ~0.000899.