gpsea.analysis.pcats package

The gpsea.analysis.pcats tests the association between genotype and phenotype classes, if the classess can be defined in terms of discrete, unique, and non-overlapping categories.

Each individual is assigned into a genotype and phenotype class using GenotypeClassifier and PhenotypeClassifier respectively.

A contingency matrix with group counts is prepared and the counts are tested for association using CountStatistic, such as FisherExactTest.

It is typical to test several phenotype groups at the same time. Therefore, we must correct for multiple testing to prevent false positive findings. See MTC section for more info.

The results are provided as MultiPhenotypeAnalysisResult (or more specific HpoTermAnalysisResult for HpoTermAnalysis).

Use configure_hpo_term_analysis() to configure the HPO term analysis with the default parameters.

class gpsea.analysis.pcats.MultiPhenotypeAnalysis(count_statistic: CountStatistic, mtc_correction: str | None = 'fdr_bh', mtc_alpha: float = 0.05)[source]

Bases: Generic[P]

compare_genotype_vs_phenotypes(cohort: Iterable[Patient], gt_clf: GenotypeClassifier, pheno_clfs: Iterable[PhenotypeClassifier[P]]) → MultiPhenotypeAnalysisResult[P][source]

class gpsea.analysis.pcats.MultiPhenotypeAnalysisResult(gt_clf: GenotypeClassifier, pheno_clfs: Iterable[PhenotypeClassifier[P]], statistic: Statistic, n_usable: Sequence[int], all_counts: Sequence[DataFrame], statistic_results: Sequence[StatisticResult | None], corrected_pvals: Sequence[float] | None, mtc_correction: str | None)[source]

Bases: Generic[P], AnalysisResult

MultiPhenotypeAnalysisResult reports the outcome of an analysis that tested the association of genotype with two or more phenotypes.

property all_counts: Sequence[DataFrame]

Get a DataFrame sequence where each DataFrame includes the counts of patients in genotype and phenotype groups.

An example for a genotype predicate that bins into two categories (Yes and No) based on presence of a missense variant in transcript NM_123456.7, and phenotype predicate that checks presence/absence of HP:0001166 (a phenotype term):

           Has MISSENSE_VARIANT in NM_123456.7
           No       Yes
Present
Yes        1        13
No         7        5

The rows correspond to the phenotype categories, and the columns represent the genotype categories.

property corrected_pvals: Sequence[float] | None: Get a sequence with p values for each tested phenotype after multiple testing correction or None if the correction was not applied. The sequence includes a NaN value for each phenotype that was not tested.

property mtc_correction: str | None: Get name/code of the used multiple testing correction (e.g. fdr_bh for Benjamini-Hochberg) or None if no correction was applied.

n_significant_for_alpha(alpha: float = 0.05) → int | None[source]

Get the count of the corrected p values with the value being less than or equal to alpha.

Parameters:: alpha – a float with significance level.

property n_usable: Sequence[int]: Get a sequence of numbers of patients where the phenotype was assessable, and are, thus, usable for genotype-phenotype correlation analysis.

property pheno_clfs: Sequence[PhenotypeClassifier[P]]: Get the phenotype classifiers used in the analysis.

property phenotypes: Sequence[P]: Get the phenotypes that were tested for association with genotype in the analysis.

property pvals: Sequence[float]: Get a sequence of nominal p values for each tested phenotype. The sequence includes a NaN value for each phenotype that was not tested.

significant_phenotype_indices(alpha: float = 0.05, pval_kind: Literal['corrected', 'nominal'] = 'corrected') → Sequence[int] | None[source]: Get the indices of phenotypes that attain significance for provided alpha.

property statistic_results: Sequence[StatisticResult | None]: Get a sequence of StatisticResult items with nominal p values and the associated statistic values for each tested phenotype or None for the untested phenotypes.

property total_tests: int: Get total count of genotype-phenotype associations that were tested in this analysis.

class gpsea.analysis.pcats.DiseaseAnalysis(count_statistic: CountStatistic, mtc_correction: str | None = 'fdr_bh', mtc_alpha: float = 0.05)[source]: Bases: MultiPhenotypeAnalysis[TermId]

class gpsea.analysis.pcats.HpoTermAnalysis(count_statistic: CountStatistic, mtc_filter: PhenotypeMtcFilter, mtc_correction: str | None = 'fdr_bh', mtc_alpha: float = 0.05)[source]

Bases: MultiPhenotypeAnalysis[TermId]

HpoTermAnalysis can be applied if the individual phenotypes are represented as HPO terms.

The analysis applies the genotype and phenotype predicates, computes the nominal p values, and addresses the multiple testing burden by applying the PhenotypeMtcFilter followed by the multiple testing correction mtc_correction method.

PhenotypeMtcFilter is applied even if no MTC should be applied.

class gpsea.analysis.pcats.HpoTermAnalysisResult(gt_clf: GenotypeClassifier, statistic: CountStatistic, mtc_correction: str | None, pheno_clfs: Iterable[PhenotypeClassifier[TermId]], n_usable: Sequence[int], all_counts: Sequence[DataFrame], statistic_results: Sequence[StatisticResult | None], corrected_pvals: Sequence[float] | None, mtc_filter_name: str, mtc_filter_results: Sequence[PhenotypeMtcResult])[source]

Bases: MultiPhenotypeAnalysisResult[TermId]

HpoTermAnalysisResult includes the HpoTermAnalysis results.

On top of the attributes of MultiPhenotypeAnalysisResult parent, the results include the outcome of PhenotypeMtcFilter.

property mtc_filter_name: str: Get the MTC filter name.

property mtc_filter_results: Sequence[PhenotypeMtcResult]: Get a PhenotypeMtcResult for each of the phenotypes.

n_filtered_out() → int[source]: Get the number of phenotype terms that were filtered out by the MTC filter.

gpsea.analysis.pcats.apply_classifiers_on_individuals(individuals: Iterable[Patient], gt_clf: GenotypeClassifier, pheno_clfs: Sequence[PhenotypeClassifier[P]]) → Tuple[Sequence[int], Sequence[DataFrame]][source]

Classify individuals with the genotype and phenotype classifiers.

Note, it may not be possible to classify all individuals with a genotype/phenotype pair, since a clasifier is allowed to return None (e.g. if it assigns the individual into MISSENSE or NONSENSE groups but the patient has no MISSENSE or NONSENSE variants). If this happens, the individual will not be “usable” for the phenotype P.

Parameters:

individuals – a sequence of individuals to classify
gt_clf – classifier to assign a genotype class
pheno_clfs – a sequence of phenotype classifiers to apply

Returns:

a sequence with counts of individuals that could be classified according to the phenotype P.
a sequence with data frames with counts of patients in i-th phenotype category and j-th genotype category where i and j are rows and columns of the data frame.

Return type:

a tuple with 2 items

gpsea.analysis.pcats.configure_hpo_term_analysis(hpo: ~hpotk.ontology._api.MinimalOntology, count_statistic: ~gpsea.analysis.pcats.stats._stats.CountStatistic = <gpsea.analysis.pcats.stats._stats.FisherExactTest object>, mtc_correction: str | None = 'fdr_bh', mtc_alpha: float = 0.05) → HpoTermAnalysis[source]

Configure HPO term analysis with default parameters.

The default analysis will pre-filter HPO terms with IfHpoFilter, then compute nominal p values using count_statistic (default Fisher exact test), and apply multiple testing correction (default Benjamini/Hochberg (fdr_bh)) with target mtc_alpha (default 0.05).

Subpackages

gpsea.analysis.pcats.stats package
- CountStatistic
- FisherExactTest