Survival analysis

Example analysis

We will analyze the time until end stage renal disease in 207 individuals with mutations in UMOD. Specifically, we will test for difference between the onset of the end stage renal disease in the individuals with mutation in exon 3 of UMOD vs. individuals with other UMOD mutation.

Load cohort

For the purpose of this analysis, we will load the Cohort from a JSON file. The cohort was prepared from phenopackets as described in Create a cohort section, and then serialized as a JSON file following the instructions in Persist the cohort for later section.

>>> import json
>>> from gpsea.io import GpseaJSONDecoder
>>> fpath_cohort_json = 'docs/cohort-data/UMOD.0.1.20.json'
>>> with open(fpath_cohort_json) as fh:
...     cohort = json.load(fh, cls=GpseaJSONDecoder)
>>> len(cohort)
207

Configure analysis

MANE transcript of UMOD.

>>> tx_id = 'NM_003361.4'

Genotype classifier

One allele of exon 3 vs. one allele of elsewhere.

>>> from gpsea.analysis.predicate import exon
>>> is_in_exon3 = exon(exon=3, tx_id=tx_id)
>>> is_in_exon3.description
'overlaps with exon 3 of NM_003361.4'

Monoallelic classifier to compare one allele of UMOD exon 3 variant versus one allele of other UMOD variant:

>>> from gpsea.analysis.clf import monoallelic_classifier
>>> gt_clf = monoallelic_classifier(
...     a_predicate=is_in_exon3,
...     b_predicate=~is_in_exon3,
...     a_label="Exon 3", b_label="Other",
... )
>>> gt_clf.class_labels
('Exon 3', 'Other')

Survival endpoint

The endpoint of our study is defined as development of end stage renal disease. In the UMOD cohort, this is encoded with Stage 5 chronic kidney disease (HP:0003774) HPO term. We need to leverage the HPO hierarchy when computing the onset of an HPO term. Let’s load HPO:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')

and now we can create an Endpoint to compute the time until an individual develops end stage renal disease:

>>> from gpsea.analysis.temporal.endpoint import hpo_onset
>>> term_id = "HP:0003774"  # Stage 5 chronic kidney disease
>>> endpoint = hpo_onset(hpo=hpo, term_id=term_id)
>>> endpoint.description
'Compute time until onset of Stage 5 chronic kidney disease'

Statistical test

We will use Log rank test to compare the age until the endpoint between the genotype groups:

>>> from gpsea.analysis.temporal.stats import LogRankTest
>>> survival_statistic = LogRankTest()

Final analysis

We will put the final analysis together into a SurvivalAnalysis.

>>> from gpsea.analysis.temporal import SurvivalAnalysis
>>> survival_analysis = SurvivalAnalysis(
...     statistic=survival_statistic,
... )

Analysis

We execute the analysis by running

>>> result = survival_analysis.compare_genotype_vs_survival(
...     cohort=cohort,
...     gt_clf=gt_clf,
...     endpoint=endpoint,
... )

>>> result.pval
0.062004258300...

Kaplan-Meier curves

We can plot Kaplan-Meier curves:

>>> from gpsea.model import Age
>>> import matplotlib as mpl
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(figsize=(6, 4), dpi=120)
>>> result.plot_kaplan_meier_curves(
...     ax=ax,
... )
>>> _ = ax.xaxis.set(
...     # Show X axis in years ...
...     major_formatter=mpl.ticker.FuncFormatter(lambda x, pos: f"{x / Age.DAYS_IN_YEAR:.0f}"),
...     # ... with a tick for every decade
...     major_locator=mpl.ticker.MultipleLocator(10 * Age.DAYS_IN_YEAR),
... )
>>> _ = ax.set(
...     xlabel=endpoint.name + " [years]",
...     ylabel="Event-free proportion",
... )
>>> _ = ax.grid(axis="y")

Raw data

The result includes the survival values for all cohort members:

>>> survivals = result.data.sort_index()
>>> survivals.head()
                          genotype    phenotype
patient_id
AII.1[PMID_22034507_AII_1]       0    Survival(value=18262.5, is_censored=True)
AII.2[PMID_22034507_AII_2]       0    None
AII.3[PMID_22034507_AII_3]       0    Survival(value=16436.25, is_censored=True)
AII.5[PMID_22034507_AII_5]       0    Survival(value=22280.25, is_censored=False)
AIII.4[PMID_22034507_AIII_4]     0    Survival(value=19723.5, is_censored=False)

Each line corresponeds to an individual and the dataframe is indexed by the individual’s identifier/label. The genotype column contains the genotype class code, and phenotype column includes a Survival value or None if computing the survival was impossible (see hpo_onset() for details). The Survival reports the number of days until attaining the endpoint, here defined as end stage renal disease (is_censored=False), or until the individual dropped out of the analysis (is_censored=True).

Troubleshooting

Sometimes the survival analysis fails and an AnalysisException is raised. For instance, the current Logrank test implementation reports a p value of NaN if the survival is the same for all individuals. This is unlikely an expected outcome, therefore GPSEA raises an AnalysisException to force the user to troubleshoot.

To help with troubleshooting, the data computed prior detecting the error is included in the exception’s data attribute. In survival analysis, the data should include the identifiers, genotype classes, and survivals of the tested individuals.

Let’s show this on an example. We will create a toy cohort of 10 individuals with onset of Lynch syndrome I (OMIM:120435) at 40 years.

>>> from gpsea.model import Cohort, Patient, Disease, Age
>>> onset = Age.from_iso8601_period("P40Y")
>>> individuals = [
...     Patient.from_raw_parts(
...         labels=label,
...         diseases=(
...             Disease.from_raw_parts(
...                 term_id="OMIM:120435",
...                 name="Lynch syndrome I",
...                 is_observed=True,
...                 onset=onset,
...             ),
...         ),
...     )
...     for label in "ABCDEFGHIJ"  # 10 individuals
... ]
>>> cohort = Cohort.from_patients(individuals)

We will assign them into genotype classes on random, …

>>> from gpsea.analysis.clf import random_classifier
>>> gt_clf = random_classifier(seed=123)
>>> gt_clf.description
'Classify the individual into random classes'

… using the Lynch syndrome I diagnosis as the endpoint …

>>> from gpsea.analysis.temporal.endpoint import disease_onset
>>> endpoint = disease_onset(disease_id="OMIM:120435")
>>> endpoint.description
'Compute time until OMIM:120435 onset'

… and we will use Logrank test for differences in survival.

>>> from gpsea.analysis.temporal.stats import LogRankTest
>>> survival_statistic = LogRankTest()

We put together the survival analysis …

>>> from gpsea.analysis.temporal import SurvivalAnalysis
>>> survival_analysis = SurvivalAnalysis(
...     statistic=survival_statistic,
... )

… which we expect to fail with an AnalysisException:

>>> result = survival_analysis.compare_genotype_vs_survival(
...     cohort=cohort,
...     gt_clf=gt_clf,
...     endpoint=endpoint,
... )
Traceback (most recent call last):
  ...
gpsea.analysis._base.AnalysisException: The survival values did not meet the expectation of the statistical test!

The genotype classes and survival values can be retrieved from the exception:

>>> from gpsea.analysis import AnalysisException
>>> try:
...     result = survival_analysis.compare_genotype_vs_survival(
...         cohort=cohort,
...         gt_clf=gt_clf,
...         endpoint=endpoint,
...     )
... except AnalysisException as ae:
...     genotypes = ae.data["genotype"]
...     survivals = ae.data["survival"]

and the values can come in handy in troubleshooting:

>>> genotypes[:3]
(0, 0, 0)
>>> survivals[:3]
(Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False))