`.. _measurement-stat:

Compare measurement values

Example analysis

We will analyze 69 individuals reported by Xu et al. (2019). The reported data includes the CYP21A2 mutations, the lab measurement, and other clinical signs and symptoms encoded as HPO terms.

TODO: wordsmith

Load cohort

For the purpose of this analysis, we will load the Cohort from a JSON file. The cohort was prepared from phenopackets as described in Create a cohort from GA4GH phenopackets section, and then serialized as a JSON file following the instructions in Persist the cohort for later section.

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort_json = 'docs/cohort-data/CYP21A2.0.1.20.json'
>>> with open(fpath_cohort_json) as fh:
...     cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
69

Configure analysis

MANE transcript of CYP21A2.

>>> tx_id = 'NM_000500.9'

Genotype predicate

For now, just Missense vs rest. Nonsense variants + big SVs, Missense should NOT be severe. TODO - create real predicate.

>>> from gpsea.model import VariantEffect
>>> from gpsea.analysis.predicate.genotype import VariantPredicates
>>> is_missense = VariantPredicates.variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id=tx_id)
>>> is_missense.description
'MISSENSE_VARIANT on NM_000500.9'

Assuming AR inheritance, we compare missense vs. rest:

>>> from gpsea.analysis.predicate.genotype import biallelic_predicate
>>> gt_predicate = biallelic_predicate(
...     a_predicate=is_missense,
...     b_predicate=~is_missense,
...     a_label="Missense", b_label="Other",
...     partitions=({0,}, {1, 2}),
... )
>>> gt_predicate.group_labels
('Missense/Missense', 'Missense/Other OR Other/Other')

Phenotype score

We investigate an association between a measurement value and a genotype group.

We use the measurement of Testosterone [Mass/volume] in Serum or Plasma (LOINC:2986-8).

>>> from gpsea.analysis.pscore import MeasurementPhenotypeScorer
>>> testosterone = 'LOINC:2986-8'
>>> pheno_scorer = MeasurementPhenotypeScorer.from_measurement_id(
...     term_id=testosterone,
...     label="Testosterone [Mass/volume] in Serum or Plasma",
... )
>>> pheno_scorer.description
'Value of Testosterone [Mass/volume] in Serum or Plasma [LOINC:2986-8]'

Statistical test

We will use T test to test for differences between scores of the different genotype groups

>>> from gpsea.analysis.pscore.stats import TTestStatistic
>>> score_statistic = TTestStatistic()

Final analysis

We will put the final analysis together into PhenotypeScoreAnalysis.

>>> from gpsea.analysis.pscore import PhenotypeScoreAnalysis
>>> score_analysis = PhenotypeScoreAnalysis(
...     score_statistic=score_statistic,
... )

Analysis

We execute the analysis by running

>>> result = score_analysis.compare_genotype_vs_phenotype_score(
...     cohort=cohort,
...     gt_predicate=gt_predicate,
...     pheno_scorer=pheno_scorer,
... )
>>> result.pval
0.741216622359659

Show data frame with scores

>>> scores = result.data.sort_index()
>>> scores.head()  
                                            genotype  phenotype
patient_id
individual 10[PMID_30968594_individual_10]         1      614.0
individual 11[PMID_30968594_individual_11]         1      630.0
individual 12[PMID_30968594_individual_12]         1        NaN
individual 13[PMID_30968594_individual_13]         1      303.0
individual 14[PMID_30968594_individual_14]         1      664.0

Prepare genotype category legend:

>>> gt_id_to_name = {c.category.cat_id: c.category.name for c in gt_predicate.get_categorizations()}
>>> gt_id_to_name
{0: 'Missense/Missense', 1: 'Missense/Other OR Other/Other'}

TODO: wordsmith & finish!