gpsea.analysis.mtc_filter package

The gpsea.analysis.mtc_filter provides the strategies for reducing multiple testing burden by pre-filtering the phenotype terms to test in advance.

See MTC filters section for more info.

class gpsea.analysis.mtc_filter.PhenotypeMtcFilter[source]

Bases: Generic[P]

PhenotypeMtcFilter decides which phenotypes should be tested and which phenotypes are not worth testing in order to reduce the multiple testing burden.

Note, the filter works only when using the HPO term to represent the phenotype. Therefore, the expected input asks for TermId items. For instance, n_usable is a mapping from an HPO term to an int with the count of the patients categorized according to the HPO term.

PhenotypeMtcFilter.OK is returned for HPO terms that pass MTC filtering and should be included in the analysis.

OK = PhenotypeMtcResult(status=True, issue=None)

The MTC result for the phenotypes that pass the filtering and should be included in the analysis.

abstract filter(gt_predicate: GenotypePolyPredicate, ph_predicates: Sequence[PhenotypePolyPredicate[P]], counts: Sequence[DataFrame], cohort_size: int) Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:
  • gt_predicate – the predicate that produced the columns of the count data frame.

  • ph_predicates – the phenotype predicates that produced the rows of the counts data frames.

  • counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corresponds to a genotype/phenotype contingency matrix.

  • cohort_size – the size of the cohort.

Returns:

a sequence of filter results for the input phenotypes.

abstract possible_results() Collection[PhenotypeMtcResult][source]

Return all possible result types which the PhenotypeMtcFilter can produce.

abstract filter_method_name() str[source]

Get a str with the MTC filter name to display for humans.

class gpsea.analysis.mtc_filter.PhenotypeMtcResult(status: bool, issue: PhenotypeMtcIssue | None)[source]

Bases: object

PhenotypeMtcResult represents a result of PhenotypeMtcFilter for a single phenotype.

The phenotype can either pass the filter, in order to be included in the downstream analysis (is_passed()) of be filtered out (is_filtered_out()) in which case mtc_issue with more context regarding the culprit must be available.

static ok() PhenotypeMtcResult[source]
static fail(code: str, reason: str) PhenotypeMtcResult[source]
is_passed() bool[source]
is_filtered_out() bool[source]
property mtc_issue: PhenotypeMtcIssue | None
property reason: str | None
class gpsea.analysis.mtc_filter.PhenotypeMtcIssue(code: str, reason: str)[source]

Bases: object

The container for data available regarding the reason why a phenotype was filtered out.

code: str

A str with a unique code of the issue.

reason: str

A human-friendly explanation of the issue.

class gpsea.analysis.mtc_filter.UseAllTermsMtcFilter[source]

Bases: PhenotypeMtcFilter[Any]

UseAllTermsMtcFilter filters out no phenotype terms.

See Test all terms section for more info.

filter(gt_predicate: GenotypePolyPredicate, ph_predicates: Sequence[PhenotypePolyPredicate[P]], counts: Sequence[DataFrame], cohort_size: int) Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:
  • gt_predicate – the predicate that produced the columns of the count data frame.

  • ph_predicates – the phenotype predicates that produced the rows of the counts data frames.

  • counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corresponds to a genotype/phenotype contingency matrix.

  • cohort_size – the size of the cohort.

Returns:

a sequence of filter results for the input phenotypes.

possible_results() Collection[PhenotypeMtcResult][source]

Return all possible result types which the PhenotypeMtcFilter can produce.

filter_method_name() str[source]

Get a str with the MTC filter name to display for humans.

class gpsea.analysis.mtc_filter.SpecifiedTermsMtcFilter(terms_to_test: Iterable[TermId | str])[source]

Bases: PhenotypeMtcFilter[TermId]

SpecifiedTermsMtcFilter limits the HPO terms to be tested to a selection of provided terms.

In cases where we have a hypothesis about which phenotypes are relevant for testing genotype-pehnotype correlations, we can pass the corresponding terms to the constructor of this class, thereby preventing other terms from being tested and reducing the multiple testing burden.

See Specify terms strategy section for more info.

NON_SPECIFIED_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='ST1', reason='Non-specified term'))

The MTC filtering result returned when an HPO term does not belong among the selection of terms to be tested.

Parameters:

terms_to_test – an iterable of items of CURIE str or TermId representing the terms to test.

property terms_to_test: Collection[TermId]
filter(gt_predicate: GenotypePolyPredicate, ph_predicates: Sequence[PhenotypePolyPredicate[P]], counts: Sequence[DataFrame], cohort_size: int) Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:
  • gt_predicate – the predicate that produced the columns of the count data frame.

  • ph_predicates – the phenotype predicates that produced the rows of the counts data frames.

  • counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corresponds to a genotype/phenotype contingency matrix.

  • cohort_size – the size of the cohort.

Returns:

a sequence of filter results for the input phenotypes.

possible_results() Collection[PhenotypeMtcResult][source]

Return all possible result types which the PhenotypeMtcFilter can produce.

filter_method_name() str[source]

Get a str with the MTC filter name to display for humans.

static verify_term_id(val: str | TermId) TermId[source]
class gpsea.analysis.mtc_filter.HpoMtcFilter(hpo: MinimalOntology, term_frequency_threshold: float, annotation_frequency_threshold: float, general_hpo_terms: Iterable[TermId])[source]

Bases: PhenotypeMtcFilter[TermId]

HpoMtcFilter decides which phenotypes should be tested and which phenotypes are not worth testing.

The class leverages a number of heuristics and domain decisions. See HPO MT filter strategy section for more info.

We recommend creating an instance using the default_filter() static factory method.

NO_GENOTYPE_HAS_MORE_THAN_ONE_HPO = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF02', reason='Skipping term because no genotype has more than one observed HPO count'))
SAME_COUNT_AS_THE_ONLY_CHILD = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF03', reason='Skipping term because of a child term with the same individual counts'))
SKIPPING_SINCE_ONE_GENOTYPE_HAD_ZERO_OBSERVATIONS = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF05', reason='Skipping term because one genotype had zero observations'))
SKIPPING_NON_PHENOTYPE_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF07', reason='Skipping non phenotype term'))
SKIPPING_GENERAL_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF08', reason='Skipping general term'))
static default_filter(hpo: ~hpotk.ontology._api.MinimalOntology, term_frequency_threshold: float = 0.4, annotation_frequency_threshold: float = 0.4, phenotypic_abnormality: ~hpotk.model._term_id.TermId = DefaultTermId(idx=2, value=HP:0000118))[source]
Parameters:
  • hpo – HPO

  • term_frequency_threshold – a float in range \((0, 1]\) with the minimum frequency for an HPO term to have in at least one of the genotype groups (e.g., 22% in missense and 3% in nonsense genotypes would be OK, but not 13% missense and 10% nonsense genotypes if the threshold is 0.2). The default threshold is 0.4 (40%).

  • annotation_frequency_threshold – a float in range \((0, 1) with the minimum frequency of annotation in the cohort. For instance, if the cohort consists of 100 individuals, and we have explicit observed observations for 20 and excluded for 10 individuals, then the annotation frequency is `0.3\). The purpose of this threshold is to omit terms for which we simply do not have much data overall. By default, we set a threshold to 0.4 (40%).

  • phenotypic_abnormality – a TermId corresponding to the root of HPO phenotype hierarchy. Having to specify this option should be very rarely, if ever.

filter(gt_predicate: GenotypePolyPredicate, ph_predicates: Sequence[PhenotypePolyPredicate[P]], counts: Sequence[DataFrame], cohort_size: int) Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:
  • gt_predicate – the predicate that produced the columns of the count data frame.

  • ph_predicates – the phenotype predicates that produced the rows of the counts data frames.

  • counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corresponds to a genotype/phenotype contingency matrix.

  • cohort_size – the size of the cohort.

Returns:

a sequence of filter results for the input phenotypes.

possible_results() Collection[PhenotypeMtcResult][source]

Return all possible result types which the PhenotypeMtcFilter can produce.

filter_method_name() str[source]

Get a str with the MTC filter name to display for humans.

static get_number_of_observed_hpo_observations(counts_frame: DataFrame, ph_predicate: PhenotypePolyPredicate[TermId]) int[source]
static get_maximum_group_observed_HPO_frequency(counts_frame: DataFrame, ph_predicate: PhenotypePolyPredicate[TermId]) float[source]
Returns:

The maximum frequency of observed HPO annotations across all genotypes.

static one_genotype_has_zero_hpo_observations(counts: DataFrame, gt_predicate: GenotypePolyPredicate)[source]
static some_cell_has_greater_than_one_count(counts: DataFrame, ph_predicate: PhenotypePolyPredicate[TermId]) bool[source]

If no genotype has more than one HPO count, we do not want to do a test. For instance, if MISSENSE has one observed HPO and N excluded, and NOT MISSENSE has zero or one observed HPO, then we will skip the test

Parameters:
  • counts – pandas DataFrame with counts

  • ph_predicate – the phenotype predicate that produced the counts

Returns: true if at least one of the genotypes has more than one observed HPO count