Multiple-testing correction

Background

A p-value is the probability that a test result, under the null hypothesis, assumes the observed or a more extreme value. It is important to realize that if we perform many tests, we are likely to get a “significant” result by chance alone. For instance, if we test a null hypothesis that is true using a significance level of \(\alpha = 0.05\), then there is a probability of \(1-\alpha = 0.95\) of arriving at a correct conclusion of non-significance. If we now test two independent true null hypotheses, the probability that neither test is significant is \(0.95\times 0.95 = 0.90.\) If we test 20 independent null hypotheses, the probability that none will be significant is then \((0.95)^{20}=0.36\). This corresponds to a probability of \(1-0.36=0.64\) of getting at least one spurious significant result, and the expected number of spurious significant results in 20 tests is \(20\times 0.05=1\). If we perform 100 such tests, the probability that none will be significant is \((0.95)^{100}=0.01\) and there is a 99% probability of getting at least one significant result.

Implementation in GPSEA

By default, GPSEA performs a hypothesis test for each HPO term found at least twice in the cohort, meaning that we may perform up to hundreds of tests. Therefore, unless we take into account the fact that multiple statistical tests are being performed, it is likely that we will obtain one or more false-positive results.

GPSEA offers two approaches to mitigate this problem: multiple-testing correction (MTC) procedures and MTC filters to choose the terms to be tested.

Multiple-testing correction procedures

A number of MTC procedures have been developed to limit the probability of false-positive results. The MTC procedures differ in complexity, in their assumptions about the data, and in the type of control they provide.

The GPSEA package uses the Python package statsmodels to implement MTC. See the documentation for details; the following table shows allowable options.

MTC procedure

abbreviation

bonferroni

b

sidak

s

holm-sidak

hs

holm

h

simes-hochberg

sh

hommel

ho

fdr_bh

fdr_by

fdr_tsbh

fdr_tsbky

fdr_gbs

The oldest and simplest MTC procedure is the Bonferroni correction (bonferroni). The Bonferroni procedure thus provides control of the family-wise error rate (FWER), which is the probability of at least one Type I error. The Bonferroni method multiplies the p-value returned by each test (which is call the nominal p-value) by the number of tests performed (the result is capped at 1.0).

Alternatively, procedures that control the false-discovery rate (FDR), limit the proportion of significant results that are type I errors (false discoveries). The Benjamini and Hochberg method (fdr_bh) is probably the most commonly used one. This is the default method in GPSEA.

To set an alternative MTC procedure, we use the mtc_correction option when creating an instance of HpoTermAnalysis:

>>> from gpsea.analysis.mtc_filter import UseAllTermsMtcFilter
>>> from gpsea.analysis.pcats import HpoTermAnalysis
>>> from gpsea.analysis.pcats.stats import FisherExactTest
>>> analysis = HpoTermAnalysis(
...     count_statistic=FisherExactTest(),
...     mtc_filter=UseAllTermsMtcFilter(),
...     mtc_correction='bonferroni',  #      <--- The MTC correction setup
... )

MTC filters: Choosing which terms to test

We can reduce the overall MTC burden by choosing which terms to test. For example, if we choose to test only ten terms out of 450, then the mutliplication factor of the Bonferroni correction is only 10 instead of 450, and more p-values may “survive” the multiple-testing correction.

In the context of GPSEA, we represent the concept of phenotype filtering by PhenotypeMtcFilter. The filter must be chosen before the MultiPhenotypeAnalysis, such as HpoTermAnalysis, is run:

>>> from gpsea.analysis.pcats import HpoTermAnalysis
>>> analysis = HpoTermAnalysis()  
Traceback (most recent call last):
  ...
TypeError: HpoTermAnalysis.__init__() missing 2 required positional arguments: 'count_statistic' and 'mtc_filter'

Note the missing mtc_filter option.

We describe the three filtering strategies in the following sections.

Test all terms

The first MTC filtering strategy is the simplest - do not apply any filtering at all. This will result in testing all terms. We do not recommend this strategy, but it can be useful to disable MTC filtering.

The strategy is implemented in UseAllTermsMtcFilter.

>>> from gpsea.analysis.mtc_filter import UseAllTermsMtcFilter
>>> use_all = UseAllTermsMtcFilter()

Specify terms strategy

In presence of a specific hypothesis as to which terms may be different between groups, then you can specify these terms in SpecifiedTermsMtcFilter.

For example if we want to specifically test Abnormal putamen morphology (HP:0031982) and Abnormal caudate nucleus morphology (HP:0002339) we pass an iterable (e.g. a tuple) with these two terms as an argument:

>>> from gpsea.analysis.mtc_filter import SpecifiedTermsMtcFilter
>>> specified_terms = SpecifiedTermsMtcFilter(
...     terms_to_test=(
...         "HP:0031982",  # Abnormal putamen morphology
...         "HP:0002339",  # Abnormal caudate nucleus morphology
...     )
... )

HPO MTC filter strategy

Last, the HPO MTC strategy involves making several domain judgments to take advantage of the HPO structure. The strategy needs access to HPO:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')

and it is implemented in the HpoMtcFilter class:

>>> from gpsea.analysis.mtc_filter import HpoMtcFilter
>>> hpo_mtc = HpoMtcFilter.default_filter(
...     hpo=hpo,
...     term_frequency_threshold=0.2,
... )

We use static constructor default_filter() for creating HpoMtcFilter. The constructor takes a threshold as an argument (e.g. 20% in the example above) and the method’s logic is made up of 8 individual heuristics designed to skip testing the HPO terms that are unlikely to yield significant or interesting results.

HMF01 - Skip terms that occur very rarely

The term_frequency_threshold determines the mininum proportion of individuals with direct or indirect annotation by the HPO term to test. We check each of the genotype groups (e.g., MISSENSE vs. not-MISSENSE), and we only retain a term for testing if the proportion of individuals in at least one genotype group is greater than or equal to term_frequency_threshold. This is because of our assumption that even if there is statistical significance, if a term is only seen in (for example) 7% of individuals in the MISSENSE group and 2% in the not-MISSENSE group, the term is unlikely to be of great interest because it is rare.

HMF02 - Skip terms if no genotype group has more than one count

In a related heuristic, we skip terms if no genotype group has more than one count. This is not completely redundant with the previous condition, because some terms may have a small number of total observations.

HMF03 - Skip terms if all counts are identical to counts for a child term

Let’s say a term such as Posterior polar cataract (HP:0001115) was observed in 7 of 11 individuals with MISSENSE variants and in 3 of 8 individuals with NONSENSE variants. If we find the same patient counts (7 of 11 and 3 of 8) in the parent term Polar cataract HP:0010696, then we choose to not test the parent term.

This is because the more specific an HPO term is, the more information it has (the more interesting the correlation would be if it exists), and the result of a test, such as the Fisher Exact test, would be exactly the same for Polar cataract as for Posterior polar cataract.

HMF04 - Skip terms if genotypes have same HPO proportions

If both (or all) of the genotype groups have the same proportion of individuals observed to be annotated to an HPO term, e.g., both are 50%, then skip the term, because it is not possible that the Fisher exact test will return a significant result.

HMF05 - Skip term if one of the genotype groups has neither observed nor excluded observations

Skip terms if there are no HPO observations in a group.

HMF06 - Skip term if underpowered for 2x2 or 2x3 analysis

If the individuals are binned into 2 phenotype groups and 2 genotype groups (2x2) and the total count of patients in all genotype-phenotype groups is less than 7, or into 2 phenotype groups and 3 genotype groups (2x3) and the total count of patients is less than 6, then there is a lack even of the nominal statistical power and the counts can never be significant.

HMF07 - Skipping terms that are not descendents of Phenotypic abnormality

The HPO has a number of other branches that describe modes of inheritance, past medical history, and clinical modifiers. We do not think it makes much sense to test for enrichment of these terms, so, all terms that are not descendants of Phenotypic abnormality are filtered out.

HMF08 - Skipping “general” level terms

All the direct children of the root phenotype term Phenotypic abnormality (HP:0000118) are skipped, because of the assumption that if there is a valid signal, it will derive from one of the more specific descendents.

For instance, Abnormality of the nervous system (HP:0000707) is a child of Phenotypic abnormality, and this assumption implies that if there is a signal from the nervous system, it will lead to at least one of the descendents of Abnormality of the nervous system being significant.

See General HPO terms section for details.