Compare measurements

Example analysis

We will analyze 69 individuals reported by Xu et al. (2019). The reported data includes the CYP21A2 mutations, the lab measurement, and other clinical signs and symptoms encoded as HPO terms.

TODO: wordsmith

Load cohort

For the purpose of this analysis, we will load the Cohort from a JSON file. The cohort was prepared from phenopackets as described in Create a cohort section, and then serialized as a JSON file following the instructions in Persist the cohort for later section.

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort_json = 'docs/cohort-data/CYP21A2.0.1.20.json'
>>> with open(fpath_cohort_json) as fh:
...     cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
69

Configure analysis

MANE transcript of CYP21A2.

>>> tx_id = 'NM_000500.9'

Genotype predicate

For now, just Missense vs rest. Nonsense variants + big SVs, Missense should NOT be severe. TODO - create real predicate.

>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate import variant_effect
>>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=tx_id)
>>> is_missense.description
'MISSENSE_VARIANT on NM_000500.9'

Assuming AR inheritance, we compare missense vs. rest:

>>> from gpsea.analysis.clf import biallelic_classifier
>>> gt_clf = biallelic_classifier(
...     a_predicate=is_missense,
...     b_predicate=~is_missense,
...     a_label="Missense", b_label="Other",
...     partitions=({0,}, {1, 2}),
... )
>>> gt_clf.class_labels
('Missense/Missense', 'Missense/Other OR Other/Other')

Phenotype score

We investigate an association between a measurement value and a genotype group.

We use the measurement of Testosterone [Mass/volume] in Serum or Plasma (LOINC:2986-8).

>>> from gpsea.analysis.pscore import MeasurementPhenotypeScorer
>>> testosterone = 'LOINC:2986-8'
>>> pheno_scorer = MeasurementPhenotypeScorer.from_measurement_id(
...     term_id=testosterone,
...     label="Testosterone [Mass/volume] in Serum or Plasma",
... )
>>> pheno_scorer.description
'Value of Testosterone [Mass/volume] in Serum or Plasma [LOINC:2986-8]'

Statistical test

We will use T test to test for differences between scores of the different genotype groups

>>> from gpsea.analysis.pscore.stats import TTestStatistic
>>> score_statistic = TTestStatistic()

Final analysis

We will put the final analysis together into PhenotypeScoreAnalysis.

>>> from gpsea.analysis.pscore import PhenotypeScoreAnalysis
>>> score_analysis = PhenotypeScoreAnalysis(
...     score_statistic=score_statistic,
... )

Analysis

We execute the analysis by running

>>> result = score_analysis.compare_genotype_vs_phenotype_score(
...     cohort=cohort,
...     gt_clf=gt_clf,
...     pheno_scorer=pheno_scorer,
... )

>>> result.pval
0.741216622359659

Show data frame with scores

>>> scores = result.data.sort_index()
>>> scores.head()  
                                            genotype  phenotype
patient_id
individual 10[PMID_30968594_individual_10]         1      614.0
individual 11[PMID_30968594_individual_11]         1      630.0
individual 12[PMID_30968594_individual_12]         1        NaN
individual 13[PMID_30968594_individual_13]         1      303.0
individual 14[PMID_30968594_individual_14]         1      664.0

Prepare genotype category legend:

>>> gt_id_to_name = {c.category.cat_id: c.category.name for c in gt_clf.get_categorizations()}
>>> gt_id_to_name
{0: 'Missense/Missense', 1: 'Missense/Other OR Other/Other'}

TODO: wordsmith & finish!