gpsea.analysis.mtc_filter package

The gpsea.analysis.mtc_filter provides the strategies for reducing multiple testing burden by pre-filtering the phenotype terms to test in advance.

See MTC filters section for more info.

class gpsea.analysis.mtc_filter.PhenotypeMtcFilter[source]

Bases: Generic[P]

PhenotypeMtcFilter decides which phenotypes should be tested and which phenotypes are not worth testing in order to reduce the multiple testing burden.

Note, the filter works only when using the HPO term to represent the phenotype. Therefore, the expected input asks for TermId items. For instance, n_usable is a mapping from an HPO term to an int with the count of the patients categorized according to the HPO term.

PhenotypeMtcFilter.OK is returned for HPO terms that pass MTC filtering and should be included in the analysis.

OK = PhenotypeMtcResult(status=True, issue=None): The MTC result for the phenotypes that pass the filtering and should be included in the analysis.

abstractmethod filter(gt_clf: GenotypeClassifier, pheno_clfs: Sequence[PhenotypeClassifier[P]], counts: Sequence[DataFrame], cohort_size: int) → Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:

gt_clf – the clsasifier that produced the columns of the count data frame.
pheno_clfs – the phenotype classifiers that produced the rows of the counts data frames.
counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corresponds to a genotype/phenotype contingency matrix.
cohort_size – the size of the cohort.

Returns:

a sequence of filter results for the input phenotypes.

abstractmethod filter_method_name() → str[source]: Get a str with the MTC filter name to display for humans.

abstractmethod possible_results() → Collection[PhenotypeMtcResult][source]: Return all possible result types which the PhenotypeMtcFilter can produce.

class gpsea.analysis.mtc_filter.PhenotypeMtcResult(status: bool, issue: PhenotypeMtcIssue | None)[source]

Bases: object

PhenotypeMtcResult represents a result of PhenotypeMtcFilter for a single phenotype.

The phenotype can either pass the filter, in order to be included in the downstream analysis (is_passed()) of be filtered out (is_filtered_out()) in which case mtc_issue with more context regarding the culprit must be available.

static fail(code: str, reason: str, doclink: str | None = None) → PhenotypeMtcResult[source]

is_filtered_out() → bool[source]

is_passed() → bool[source]

property mtc_issue: PhenotypeMtcIssue | None

static ok() → PhenotypeMtcResult[source]

property reason: str | None

class gpsea.analysis.mtc_filter.PhenotypeMtcIssue(code: str, reason: str, doclink: str | None)[source]

Bases: object

The container for data available regarding the reason why a phenotype was filtered out.

code: str: A str with a unique code of the issue.

reason: str: A human-friendly explanation of the issue.

doclink: str | None: An URL of the documentation for the issue.

class gpsea.analysis.mtc_filter.UseAllTermsMtcFilter[source]

Bases: PhenotypeMtcFilter[Any]

UseAllTermsMtcFilter filters out no phenotype terms.

See Use all terms section for more info.

filter(gt_clf: GenotypeClassifier, pheno_clfs: Sequence[PhenotypeClassifier[P]], counts: Sequence[DataFrame], cohort_size: int) → Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:

gt_clf – the clsasifier that produced the columns of the count data frame.
pheno_clfs – the phenotype classifiers that produced the rows of the counts data frames.
counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corresponds to a genotype/phenotype contingency matrix.
cohort_size – the size of the cohort.

Returns:

a sequence of filter results for the input phenotypes.

filter_method_name() → str[source]: Get a str with the MTC filter name to display for humans.

possible_results() → Collection[PhenotypeMtcResult][source]: Return all possible result types which the PhenotypeMtcFilter can produce.

class gpsea.analysis.mtc_filter.SpecifiedTermsMtcFilter(terms_to_test: Iterable[str | TermId])[source]

Bases: PhenotypeMtcFilter[TermId]

SpecifiedTermsMtcFilter limits the HPO terms to be tested to a selection of provided terms.

In cases where we have a hypothesis about which phenotypes are relevant for testing genotype-pehnotype correlations, we can pass the corresponding terms to the constructor of this class, thereby preventing other terms from being tested and reducing the multiple testing burden.

See Specified terms section for more info.

NON_SPECIFIED_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='ST1', reason='Non-specified term', doclink='https://monarch-initiative.github.io/gpsea/stable/user-guide/analyses/mtc.html#specified-terms-mt-filter'))

The MTC filtering result returned when an HPO term does not belong among the selection of terms to be tested.

Parameters:: terms_to_test – an iterable of items of CURIE str or TermId representing the terms to test.

filter(gt_clf: GenotypeClassifier, pheno_clfs: Sequence[PhenotypeClassifier[P]], counts: Sequence[DataFrame], cohort_size: int) → Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:

gt_clf – the clsasifier that produced the columns of the count data frame.
pheno_clfs – the phenotype classifiers that produced the rows of the counts data frames.
counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corresponds to a genotype/phenotype contingency matrix.
cohort_size – the size of the cohort.

Returns:

a sequence of filter results for the input phenotypes.

filter_method_name() → str[source]: Get a str with the MTC filter name to display for humans.

possible_results() → Collection[PhenotypeMtcResult][source]: Return all possible result types which the PhenotypeMtcFilter can produce.

property terms_to_test: Collection[TermId]

static verify_term_id(val: str | TermId) → TermId[source]

class gpsea.analysis.mtc_filter.IfHpoFilter(hpo: MinimalOntology, annotation_frequency_threshold: float, general_hpo_terms: Iterable[TermId])[source]

Bases: PhenotypeMtcFilter[TermId]

IfHpoFilter decides which phenotypes should be tested and which phenotypes are not worth testing.

The class leverages a number of heuristics and domain decisions. See Independent filtering for HPO section for more info.

We recommend creating an instance using the default_filter() static factory method.

SAME_COUNT_AS_THE_ONLY_CHILD = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF03', reason='Skipping term because of a child term with the same individual counts', doclink='https://monarch-initiative.github.io/gpsea/stable/user-guide/analyses/mtc.html/#skip-terms-if-all-counts-are-identical-to-counts-for-a-child-term'))

SKIPPING_GENERAL_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF08', reason='Skipping general term', doclink='https://monarch-initiative.github.io/gpsea/stable/user-guide/analyses/mtc.html/#skipping-general-level-terms'))

SKIPPING_NON_PHENOTYPE_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF07', reason='Skipping non phenotype term', doclink='https://monarch-initiative.github.io/gpsea/stable/user-guide/analyses/mtc.html/#skipping-terms-that-are-not-descendents-of-phenotypic-abnormality'))

SKIPPING_SINCE_ONE_GENOTYPE_HAD_ZERO_OBSERVATIONS = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF05', reason='Skipping term because one genotype had zero observations', doclink='https://monarch-initiative.github.io/gpsea/stable/user-guide/analyses/mtc.html/#skip-term-if-one-of-the-genotype-groups-has-neither-observed-nor-excluded-observations'))

static default_filter(hpo: ~hpotk.ontology._api.MinimalOntology, annotation_frequency_threshold: float = 0.4, phenotypic_abnormality: ~hpotk.model._term_id.TermId = DefaultTermId(idx=2, value=HP:0000118))[source]

Parameters:

hpo – HPO
annotation_frequency_threshold – a float in range \((0, 1]\) with the minimum frequency of annotation in the cohort. For instance, if the cohort consists of 100 individuals, and we have explicit observed observations for 20 and excluded for 10 individuals, then the annotation frequency is 0.3. The purpose of this threshold is to omit terms for which we simply do not have much data overall. By default, we set a threshold to 0.4 (40%).
phenotypic_abnormality – a TermId corresponding to the root of HPO phenotype hierarchy. Having to specify this option should be very rarely, if ever.

filter(gt_clf: GenotypeClassifier, pheno_clfs: Sequence[PhenotypeClassifier[TermId]], counts: Sequence[DataFrame], cohort_size: int) → Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:

gt_clf – the clsasifier that produced the columns of the count data frame.
pheno_clfs – the phenotype classifiers that produced the rows of the counts data frames.
counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corresponds to a genotype/phenotype contingency matrix.
cohort_size – the size of the cohort.

Returns:

a sequence of filter results for the input phenotypes.

filter_method_name() → str[source]: Get a str with the MTC filter name to display for humans.

static get_number_of_observed_hpo_observations(counts_frame: DataFrame, ph_clf: PhenotypeClassifier[TermId]) → int[source]

static one_genotype_has_zero_hpo_observations(counts: DataFrame, gt_clf: GenotypeClassifier)[source]

possible_results() → Collection[PhenotypeMtcResult][source]: Return all possible result types which the PhenotypeMtcFilter can produce.

class gpsea.analysis.mtc_filter.HpoMtcFilter(hpo: MinimalOntology, annotation_frequency_threshold: float, general_hpo_terms: Iterable[TermId])[source]

Bases: IfHpoFilter

HpoMtcFilter is deprecated and will be removed in 1.0.0.

Use gpsea.analysis.mtc_filter.IfHpoFilter instead.

static default_filter(hpo: ~hpotk.ontology._api.MinimalOntology, term_frequency_threshold: float = 0.4, annotation_frequency_threshold: float = 0.4, phenotypic_abnormality: ~hpotk.model._term_id.TermId = DefaultTermId(idx=2, value=HP:0000118))[source]

Parameters:

hpo – HPO
annotation_frequency_threshold – a float in range \((0, 1]\) with the minimum frequency of annotation in the cohort. For instance, if the cohort consists of 100 individuals, and we have explicit observed observations for 20 and excluded for 10 individuals, then the annotation frequency is 0.3. The purpose of this threshold is to omit terms for which we simply do not have much data overall. By default, we set a threshold to 0.4 (40%).
phenotypic_abnormality – a TermId corresponding to the root of HPO phenotype hierarchy. Having to specify this option should be very rarely, if ever.