Tutorial

The tutorial demonstrates how to load an example Phenopacket cohort and perform genotype-phenotype analysis.

Set up analysis

genophenocorr needs HPO to do the analysis. Let’s load the ontology:

>>> import hpotk
>>> hpo = hpotk.load_minimal_ontology('data/hp.toy.json')

Tip

Use the latest HPO which you can get at http://purl.obolibrary.org/obo/hp.json

TODO - move the code from workflow and the notebook here.

Prepare samples

Now we need some samples. To keep things simple in this tutorial, we will use a toy cohort that is shipped with the package:

>>> from genophenocorr.data import get_toy_cohort
>>> cohort = get_toy_cohort()

See also

See Input data section to learn about preparing your data for the analysis.

We can then view the data using the list commands.

>>> sorted(cohort.list_all_patients())
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
>>> sorted(cohort.list_all_phenotypes())
[('HP:0001166', 14), ('HP:0001250', 20), ('HP:0001257', 17)]
>>> sorted(cohort.list_all_variants())
[('HetVar1', 13), ('HetVar2', 11), ('HomVar1', 3), ('HomVar2', 2)]
>>> sorted(cohort.list_all_proteins())
[('NP_09876.5', 26)]
>>> tx_dict = cohort.list_data_by_tx('NM_1234.5')
>>> sorted(tx_dict['NM_1234.5'].items())
[('frameshift_variant', 2), ('missense_variant', 2)]

Using the counts, we can choose and run what analyses we want. For instance, we can partition the patients into two groups based on presence/absence of a frameshift variant:

>>> from genophenocorr.analysis import CohortAnalysis
>>> from genophenocorr.constants import VariantEffect
>>> cohort_analysis = CohortAnalysis(cohort, 'NM_1234.5', hpo, include_unmeasured=False)
>>> frameshift = cohort_analysis.compare_by_variant_type(VariantEffect.FRAMESHIFT_VARIANT)
>>> frameshift 
                            With frameshift_variant         Without frameshift_variant
                                              Count Percent                      Count Percent  p-value Corrected p-values
HP:0001166 (Arachnodactyly)                       4  30.77%                         10  76.92%  0.04718            0.14154
HP:0001250 (Seizure)                             11  84.62%                          9  69.23%  0.64472            1.00000
HP:0001257 (Spasticity)                           8  61.54%                          9  69.23%  1.00000            1.00000

Or perform similar partitioning based on presence/absence of a missense variant:

>>> missense = cohort_analysis.compare_by_variant_type(VariantEffect.MISSENSE_VARIANT)
>>> missense 
                            With missense_variant         Without missense_variant
                                            Count Percent                    Count Percent   p-value Corrected p-values
HP:0001166 (Arachnodactyly)                    13  81.25%                        1  10.00%  0.000781           0.002342
HP:0001257 (Spasticity)                        11  68.75%                        6  60.00%  0.692449           1.000000
HP:0001250 (Seizure)                           12  75.00%                        8  80.00%  1.000000           1.000000

The tables present the HPO terms that annotate the cohort members and report their counts and p values for each genotype group. The rows are sorted by the p value in ascending order.