gpsea.analysis.pcats package

The gpsea.analysis.pcats tests the association between genotype and phenotype groups, if the groups can be defined in terms of discrete, unique, and non-overlapping categories.

Each individual is assigned into a genotype and phenotype group using GenotypePolyPredicate and PhenotypePolyPredicate respectively.

A contingency matrix with group counts is prepared and the counts are tested for association using CountStatistic, such as ScipyFisherExact.

It is typical to test several phenotype groups at the same time. Therefore, we must correct for multiple testing to prevent false positive findings. See MTC section for more info.

The results are provided as MultiPhenotypeAnalysisResult (or more specific HpoTermAnalysisResult for HpoTermAnalysis).

class gpsea.analysis.pcats.MultiPhenotypeAnalysis(count_statistic: CountStatistic, mtc_correction: str | None = 'fdr_bh', mtc_alpha: float = 0.05)[source]

Bases: Generic[P]

compare_genotype_vs_phenotypes(cohort: Iterable[Patient], gt_predicate: GenotypePolyPredicate, pheno_predicates: Iterable[PhenotypePolyPredicate[P]]) MultiPhenotypeAnalysisResult[P][source]
class gpsea.analysis.pcats.MultiPhenotypeAnalysisResult[source]

Bases: Generic[P]

MultiPhenotypeAnalysisResult reports the outcome of MultiPhenotypeAnalysis.

The result consists of several arrays with items with the order determined by the phenotype order.

abstract property n_usable: Sequence[int]

Get a sequence of numbers of patients where the phenotype was assessable, and are, thus, usable for genotype-phenotype correlation analysis.

abstract property all_counts: Sequence[DataFrame]

Get a DataFrame sequence where each DataFrame includes the counts of patients in genotype and phenotype groups.

An example for a genotype predicate that bins into two categories (Yes and No) based on presence of a missense variant in transcript NM_123456.7, and phenotype predicate that checks presence/absence of HP:0001166 (a phenotype term):

           Has MISSENSE_VARIANT in NM_123456.7
           No       Yes
Present
Yes        1        13
No         7        5

The rows correspond to the phenotype categories, and the columns represent the genotype categories.

abstract property pvals: Sequence[float]

Get a sequence of nominal p values for each tested HPO term. The sequence includes a NaN value for each input phenotype that was not tested.

abstract property corrected_pvals: Sequence[float] | None

Get a sequence with p values for each tested HPO term after multiple testing correction or None if the correction was not applied. The sequence includes a NaN value for each input phenotype that was not tested.

property total_tests: int

Get total count of tests that were run for this analysis.

class gpsea.analysis.pcats.DiseaseAnalysis(count_statistic: CountStatistic, mtc_correction: str | None = 'fdr_bh', mtc_alpha: float = 0.05)[source]

Bases: MultiPhenotypeAnalysis[TermId]

class gpsea.analysis.pcats.HpoTermAnalysis(count_statistic: CountStatistic, mtc_filter: PhenotypeMtcFilter, mtc_correction: str | None = 'fdr_bh', mtc_alpha: float = 0.05)[source]

Bases: MultiPhenotypeAnalysis[TermId]

HpoTermAnalysis can be applied if the individual phenotypes are represented as HPO terms.

The analysis applies the genotype and phenotype predicates, computes the nominal p values, and addresses the multiple testing burden by applying the PhenotypeMtcFilter followed by the multiple testing correction mtc_correction method.

PhenotypeMtcFilter is applied even if no MTC should be applied.

class gpsea.analysis.pcats.HpoTermAnalysisResult(pheno_predicates: Iterable[PhenotypePolyPredicate[TermId]], n_usable: Sequence[int], all_counts: Sequence[DataFrame], pvals: Sequence[float], corrected_pvals: Sequence[float] | None, gt_predicate: GenotypePolyPredicate, mtc_filter_name: str, mtc_filter_results: Sequence[PhenotypeMtcResult], mtc_name: str | None)[source]

Bases: BaseMultiPhenotypeAnalysisResult[TermId]

HpoTermAnalysisResult includes the HpoTermAnalysis results.

On top of the attributes of MultiPhenotypeAnalysisResult parent, the results include the outcome of PhenotypeMtcFilter.

property mtc_filter_name: str

Get the MTC filter name.

property mtc_filter_results: Sequence[PhenotypeMtcResult]

Get a PhenotypeMtcResult for each of the phenotypes.

property mtc_name: str | None

Get the name of the multiple testing correction (MTC) procedure (e.g. bonferroni, fdr_bh, …) or None if no MTC was performed.

gpsea.analysis.pcats.apply_predicates_on_patients(patients: Iterable[Patient], gt_predicate: GenotypePolyPredicate, pheno_predicates: Sequence[PhenotypePolyPredicate[P]]) Tuple[Sequence[int], Sequence[DataFrame]][source]

Apply the phenotype predicates pheno_predicates and the genotype predicate gt_predicate to bin the patients into categories.

Note, it may not be possible to bin all patients with a genotype/phenotype pair, since a predicate is allowed to return None (e.g. if it bins the patient into MISSENSE or NONSENSE groups but the patient has no MISSENSE or NONSENSE variants). If this happens, the patient will not be “usable” for the phenotype P.

Parameters:
  • patients – a sequence of the patients to bin into categories

  • gt_predicate – a genotype predicate to apply

  • pheno_predicates – a sequence with the phenotype predicates to apply

Returns:

  • a sequence with counts of patients that could be binned according to the phenotype P.

  • a sequence with data frames with counts of patients in i-th phenotype category and j-th genotype category where i and j are rows and columns of the data frame.

Return type:

a tuple with 2 items

Subpackages