gpsea.analysis.mtc_filter package

The gpsea.analysis.mtc_filter provides the strategies for reducing multiple testing burden by pre-filtering the phenotype terms to test in advance.

See MTC filters section for more info.

class gpsea.analysis.mtc_filter.PhenotypeMtcFilter[source]

Bases: Generic[P]

PhenotypeMtcFilter decides which phenotypes should be tested and which phenotypes are not worth testing in order to reduce the multiple testing burden.

Note, the filter works only when using the HPO term to represent the phenotype. Therefore, the expected input asks for hpotk.TermId items. For instance, n_usable is a mapping from an HPO term to an int with the count of the patients categorized according to the HPO term.

PhenotypeMtcFilter.OK is returned for HPO terms that pass MTC filtering and should be included in the analysis.

OK = PhenotypeMtcResult(status=True, issue=None)

The MTC result for the phenotypes that pass the filtering and should be included in the analysis.

abstract filter(gt_predicate: GenotypePolyPredicate, ph_predicates: Sequence[PhenotypePolyPredicate[P]], counts: Sequence[DataFrame]) Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:
  • gt_predicate – the predicate that produced the columns of the count data frame.

  • ph_predicates – the phenotype predicates that produced the rows of the counts data frames.

  • counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corrresponds to a genotype/phenotype contingency matrix.

Returns:

a sequence of filter results for the input phenotypes.

abstract possible_results() Collection[PhenotypeMtcResult][source]

Return all possible result types which the PhenotypeMtcFilter can produce.

abstract filter_method_name() str[source]

Get a str with the MTC filter name to display for humans.

class gpsea.analysis.mtc_filter.PhenotypeMtcResult(status: bool, issue: PhenotypeMtcIssue | None)[source]

Bases: object

PhenotypeMtcResult represents a result of PhenotypeMtcFilter for a single phenotype.

The phenotype can either pass the filter, in order to be included in the downstream analysis (is_passed()) of be filtered out (is_filtered_out()) in which case mtc_issue with more context regarding the culprit must be available.

static ok() PhenotypeMtcResult[source]
static fail(code: str, reason: str) PhenotypeMtcResult[source]
is_passed() bool[source]
is_filtered_out() bool[source]
property mtc_issue: PhenotypeMtcIssue | None
property reason: str | None
class gpsea.analysis.mtc_filter.PhenotypeMtcIssue(code: str, reason: str)[source]

Bases: object

The container for data available regarding the reason why a phenotype was filtered out.

code: str

A str with a unique code of the issue.

reason: str

A human-friendly explanation of the issue.

class gpsea.analysis.mtc_filter.UseAllTermsMtcFilter[source]

Bases: PhenotypeMtcFilter[Any]

UseAllTermsMtcFilter filters out no phenotype terms.

See Test all terms section for more info.

filter(gt_predicate: GenotypePolyPredicate, ph_predicates: Sequence[PhenotypePolyPredicate[P]], counts: Sequence[DataFrame]) Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:
  • gt_predicate – the predicate that produced the columns of the count data frame.

  • ph_predicates – the phenotype predicates that produced the rows of the counts data frames.

  • counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corrresponds to a genotype/phenotype contingency matrix.

Returns:

a sequence of filter results for the input phenotypes.

possible_results() Collection[PhenotypeMtcResult][source]

Return all possible result types which the PhenotypeMtcFilter can produce.

filter_method_name() str[source]

Get a str with the MTC filter name to display for humans.

class gpsea.analysis.mtc_filter.SpecifiedTermsMtcFilter(terms_to_test: Iterable[TermId])[source]

Bases: PhenotypeMtcFilter[TermId]

SpecifiedTermsMtcFilter limits the HPO terms to be tested to a selection of provided terms.

In cases where we have a hypothesis about which phenotypes are relevant for testing genotype-pehnotype correlations, we can pass the corresponding terms to the constructor of this class, thereby preventing other terms from being tested and reducing the multiple testing burden.

See Specify terms strategy section for more info.

NON_SPECIFIED_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='ST1', reason='Non-specified term'))

The MTC filtering result returned when an HPO term does not belong among the selection of terms to be tested.

filter(gt_predicate: GenotypePolyPredicate, ph_predicates: Sequence[PhenotypePolyPredicate[P]], counts: Sequence[DataFrame]) Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:
  • gt_predicate – the predicate that produced the columns of the count data frame.

  • ph_predicates – the phenotype predicates that produced the rows of the counts data frames.

  • counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corrresponds to a genotype/phenotype contingency matrix.

Returns:

a sequence of filter results for the input phenotypes.

possible_results() Collection[PhenotypeMtcResult][source]

Return all possible result types which the PhenotypeMtcFilter can produce.

filter_method_name() str[source]

Get a str with the MTC filter name to display for humans.

class gpsea.analysis.mtc_filter.HpoMtcFilter(hpo: MinimalOntology, term_frequency_threshold: float, general_hpo_terms: Iterable[TermId])[source]

Bases: PhenotypeMtcFilter[TermId]

HpoMtcFilter decides which phenotypes should be tested and which phenotypes are not worth testing.

The class leverages a number of heuristics and domain decisions. See HPO MTC filter strategy section for more info.

We recommend creating an instance using the default_filter() static factory method.

NO_GENOTYPE_HAS_MORE_THAN_ONE_HPO = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF02', reason='Skipping term because no genotype has more than one observed HPO count'))
SAME_COUNT_AS_THE_ONLY_CHILD = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF03', reason='Skipping term because of a child term with the same individual counts'))
SKIPPING_SAME_OBSERVED_PROPORTIONS = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF04', reason='Skipping term because all genotypes have same HPO observed proportions'))
SKIPPING_SINCE_ONE_GENOTYPE_HAD_ZERO_OBSERVATIONS = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF05', reason='Skipping term because one genotype had zero observations'))
SKIPPING_NON_PHENOTYPE_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF07', reason='Skipping non phenotype term'))
SKIPPING_GENERAL_TERM = PhenotypeMtcResult(status=False, issue=PhenotypeMtcIssue(code='HMF08', reason='Skipping general term'))
static default_filter(hpo: ~hpotk.ontology._api.MinimalOntology, term_frequency_threshold: float, phenotypic_abnormality: ~hpotk.model._term_id.TermId = DefaultTermId(idx=2, value=HP:0000118))[source]
Parameters:
  • hpo – HPO

  • term_frequency_threshold – a float in range \((0, 1]\) with the minimum frequency for an HPO term to have in at least one of the genotype groups (e.g., 22% in missense and 3% in nonsense genotypes would be OK, but not 13% missense and 10% nonsense genotypes if the threshold is 0.2)

  • phenotypic_abnormality – a TermId corresponding to the root of HPO phenotype hierarchy. Having to specify this option should be very rarely, if ever.

filter(gt_predicate: GenotypePolyPredicate, ph_predicates: Sequence[PhenotypePolyPredicate[P]], counts: Sequence[DataFrame]) Sequence[PhenotypeMtcResult][source]

Test if the phenotype with given counts should be included in the downstream analysis.

Parameters:
  • gt_predicate – the predicate that produced the columns of the count data frame.

  • ph_predicates – the phenotype predicates that produced the rows of the counts data frames.

  • counts – a sequence of 2D data frames for the tested phenotypes. Each data frame corrresponds to a genotype/phenotype contingency matrix.

Returns:

a sequence of filter results for the input phenotypes.

possible_results() Collection[PhenotypeMtcResult][source]

Return all possible result types which the PhenotypeMtcFilter can produce.

filter_method_name() str[source]

Get a str with the MTC filter name to display for humans.

static get_number_of_observed_hpo_observations(counts_frame: DataFrame, ph_predicate: PhenotypePolyPredicate[TermId]) int[source]
static get_maximum_group_observed_HPO_frequency(counts_frame: DataFrame, ph_predicate: PhenotypePolyPredicate[TermId]) float[source]
Returns:

The maximum frequency of observed HPO annotations across all genotypes.

static one_genotype_has_zero_hpo_observations(counts: DataFrame, gt_predicate: GenotypePolyPredicate)[source]
static some_cell_has_greater_than_one_count(counts: DataFrame, ph_predicate: PhenotypePolyPredicate[TermId]) bool[source]

If no genotype has more than one HPO count, we do not want to do a test. For instance, if MISSENSE has one observed HPO and N excluded, and NOT MISSENSE has zero or one observed HPO, then we will skip the test

Parameters:
  • counts – pandas DataFrame with counts

  • ph_predicate – the phenotype predicate that produced the counts

Returns: true if at least one of the genotypes has more than one observed HPO count

static genotypes_have_same_hpo_proportions(counts: DataFrame, gt_predicate: GenotypePolyPredicate, ph_predicate: PhenotypePolyPredicate[TermId], delta: float = 0.0005) bool[source]

If each genotype has the same proportion of observed HPOs, then we do not want to do a test. For instance, if MISSENSE has 5/5 observed HPOs and NOT MISSENSE has 7/7 it makes not sense to do a statistical test. :param counts: pandas DataFrame with counts :param delta: a float for tolerance comparing the proportion tolerance

Returns: true if the genotypes differ by more than delta