gpsea.model package

The gpsea.model package defines data model classes used in GPSEA. We start with the top-level elements, such as Cohort and Patient, and we follow with data classes for phenotype, genotype, transcript, and protein info.

class gpsea.model.Cohort(members: Iterable[Patient], excluded_member_count: int)[source]

Bases: Sized, Iterable[Patient]

Cohort is a collection of individuals that have been preprocessed and are ready for genotype-phenotype association analysis.

static from_patients(members: Iterable[Patient], include_patients_with_no_HPO: bool = False, include_patients_with_no_variants: bool = False)[source]

Create a cohort from a sequence of patients.

property all_patients: Collection[Patient]

Get a collection of all patients in the cohort.

all_phenotypes() Set[Phenotype][source]

Get a set of all phenotypes (observed or excluded) in the cohort members.

count_distinct_hpo_terms() int[source]

Get count of distinct HPO terms (either in present or excluded state) seen in the cohort members.

all_measurements() Set[Measurement][source]

Get a set of all phenotypes (observed or excluded) in the cohort members.

all_diseases() Set[Disease][source]

Get a set of all diseases (observed or excluded) in the cohort members.

count_with_disease_onset() int[source]

Get the count of individuals with recorded disease onset.

all_variants() Set[Variant][source]

Get a set of all variants observed in the cohort members.

all_variant_infos() Set[VariantInfo][source]

Get a set of variant infos observed in the cohort members.

property all_transcript_ids: Set[str]

Get a set of all transcript IDs affected by the cohort variants.

property total_patient_count

Get the total number of cohort members.

get_patient_ids() Set[str][source]

Get a set of the patient IDs.

list_present_phenotypes(top: int | None = None) Sequence[Tuple[str, int]][source]

Get a sequence with counts of HPO terms used as direct annotations of the cohort members.

Parameters:

typing.Optional[int] – If not given, lists all present phenotypes. Otherwise, lists only the top highest counts

Returns:

A sequence of tuples, formatted (phenotype CURIE,

number of patients with that phenotype)

Return type:

Sequence[Tuple[str, int]]

list_measurements(top: int | None = None) Sequence[Tuple[str, int]][source]

Get a sequence with counts of HPO terms used as direct annotations of the cohort members.

Parameters:

typing.Optional[int] – If not given, lists all present phenotypes. Otherwise, lists only the top highest counts

Returns:

A sequence of tuples, formatted (phenotype CURIE,

number of patients with that phenotype)

Return type:

Sequence[Tuple[str, int]]

list_all_diseases(top=None) Sequence[Tuple[str, int]][source]
list_all_variants(top=None) Sequence[Tuple[str, int]][source]
Parameters:

typing.Optional[int] – If not given, lists all variants. Otherwise, lists only the top highest counts

Returns:

A sequence of tuples, formatted (variant key, number of patients with that variant)

Return type:

list

list_all_proteins(top=None) Sequence[Tuple[str, int]][source]
Parameters:

typing.Optional[int] – If not given, lists all proteins. Otherwise, lists only the top highest counts.

Returns:

A list of tuples, formatted (protein ID string, the count of variants that affect the protein)

Return type:

list

variant_effect_count_by_tx(tx_id: str | None = None) Mapping[str, Mapping[str, int]][source]

Count variant effects for all transcripts or for a transcript tx_id of choice.

Parameters:

tx_id – a str with transcript accession (e.g. NM_123456.5) or None if all transcripts should be listed.

Returns:

Each transcript ID references a Counter(),

with the variant effect as the key and the count of variants with that effect on the transcript id.

Return type:

Mapping[str, Mapping[str, int]]

get_excluded_count() int[source]
get_variant_by_key(variant_key) Variant[source]
count_males() int[source]

Get the number of males in the cohort.

count_females() int[source]

Get the number of females in the cohort.

count_unknown_sex() int[source]

Get the number of individuals with unknown sex in the cohort.

count_alive() int[source]

Get the number of individuals reported to be alive at the time of last encounter.

count_deceased() int[source]

Get the number of individuals reported to be deceased.

count_unknown_vital_status() int[source]

Get the number of individuals with unknown or no reported vital status.

count_with_age_of_last_encounter() int[source]

Get the number of individuals with a known age of last encounter.

class gpsea.model.Patient(labels: SampleLabels, sex: Sex, age: Age | None, vital_status: VitalStatus | None, phenotypes: Iterable[Phenotype], measurements: Iterable[Measurement], diseases: Iterable[Disease], variants: Iterable[Variant])[source]

Bases: object

Patient represents a single investigated individual.

We need to know about the following attributes:

  • identifier(s) formatted as SampleLabels

  • Sex

  • age of last clinical encounter (optional) formatted as Age or None if not available.

  • vital status (optional) formatted as VitalStatus, which reports if the individual is alive or deceased plus (optional) age of death

  • HPO terms to represent the phenotype information, each HPO formatted as an instance of Phenotype

  • numerical measurements

  • disease diagnoses formatted as Disease

  • genotype information as one or more Variant

Note

We strongly recommend using the from_raw_parts() static constructor instead of __init__.

static from_raw_parts(labels: str | SampleLabels, sex: Sex | None = None, age: Age | None = None, vital_status: VitalStatus | None = None, phenotypes: Iterable[Phenotype] = (), measurements: Iterable[Measurement] = (), diseases: Iterable[Disease] = (), variants: Iterable[Variant] = ()) Patient[source]

Create Patient from the primary data.

property patient_id: str

Get a unique patient ID.

property labels: SampleLabels

Get the sample identifiers.

property sex: Sex

Get the “phenotype sex” of the sample.

property age: Age | None

Get age of the individual or None if not available.

property vital_status: VitalStatus | None

Get the vital status information for the individual or None if not available.

property phenotypes: Sequence[Phenotype]

Get the phenotypes observed and excluded in the patient.

phenotype_by_id(term_id: str | TermId) Phenotype | None[source]

Get a phenotype with an identifier or None if the individual has no such phenotype.

property measurements: Sequence[Measurement]

Get the measurements in the patient.

measurement_by_id(term_id: str | TermId) Measurement | None[source]

Get a measurement with an identifier or None if the individual has no such measurement.

Parameters:

term_id – a str with CURIE or a TermId representing the term ID of a measurement (e.g. LOINC:2986-8 for Testosterone[Mass/Vol]).

Returns:

the corresponding Measurement or None if not found in the patient.

property diseases: Sequence[Disease]

Get the diseases the patient has (not) been diagnosed with.

disease_by_id(term_id: str | TermId) Disease | None[source]

Get a disease with an identifier or None if the individual has no such disease.

property variants: Sequence[Variant]

Get a list of variants observed in the patient.

present_phenotypes() Iterator[Phenotype][source]

Get an iterator over the present phenotypes of the patient.

excluded_phenotypes() Iterator[Phenotype][source]

Get an iterator over the excluded phenotypes of the patient.

present_diseases() Iterator[Disease][source]

Get an iterator with diseases the patient was diagnosed with.

excluded_diseases() Iterator[Disease][source]

Get an iterator with diseases whose presence was excluded in the patient.

class gpsea.model.SampleLabels(label: str, meta_label: str | None = None)[source]

Bases: object

A data model for subject identifiers.

The subject has a mandatory label and an optional meta_label.

The identifiers support natural ordering, equality tests, and are hashable.

property label: str
property meta_label: str | None
label_summary() str[source]

Summarize label and meta_label into a str where the sub-parts are inserted as <label>[<meta_label>].

class gpsea.model.Sex(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Sex represents typical “phenotypic sex”, as would be determined by a midwife or physician at birth.

The definition is aligned with Phenopacket Schema

UNKNOWN_SEX = 0

Not assessed or not available. Maps to NCIT:C17998.

FEMALE = 1

Female sex. Maps to NCIT:C46113.

MALE = 2

Male sex. Maps to NCIT:C46112.

is_provided() bool[source]

Return True if the sex is a known value, such as FEMALE or MALE.

is_unknown() bool[source]

Return True if this is an UNKNOWN_SEX.

is_female() bool[source]

Return True if the sex represents a FEMALE.

is_male() bool[source]

Return True if the sex represents a MALE.

class gpsea.model.VitalStatus(status: gpsea.model._cohort.Status, age_of_death: gpsea.model._temporal.Age | None)[source]

Bases: object

status: Status
age_of_death: Age | None
property is_alive: bool
property is_deceased: bool
property is_unknown: bool
class gpsea.model.Status(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

UNKNOWN = 0
ALIVE = 1
DECEASED = 2
class gpsea.model.Age(days: float, timeline: Timeline)[source]

Bases: object

Representation of an age of an individual.

In GPSEA, we model the age at the resolution of a day. The age is either gestational or postnatal. A gestational age is created from the number of weeks and days since the last menstrual period and the postnatal age is created from years, months and days since birth.

Age overloads the comparison operators and can, thus, be compared or sorted. Gestational age is always before the postnatal age.

Internally, the age is always stored as the number of days.

ISO8601PT = re.compile('^P(?P<year>\\d+Y)?(?P<month>\\d+M)?(?P<week>\\d+W)?(?P<day>\\d+D)?(T(\\d+H)?(\\d+M)?(\\d+S)?)?$')
DAYS_IN_YEAR = 365.25
DAYS_IN_MONTH = 30.4375
DAYS_IN_WEEK = 7
static gestational(weeks: int, days: int = 0) Age[source]

Create age from the weeks and days on the gestational timeline.

Parameters:
  • weeks – a non-negative int with the number of completed weeks of gestation.

  • days – an int in range \([0, 6]\) representing the number of completed gestational days.

static last_menstrual_period() Age[source]

Age of an individual at last menstrual period.

static gestational_days(days: int | float) Age[source]

Create a gestational age corresponding to the number of days since the last menstrual period.

static birth() Age[source]

Age of an individual at birth.

static postnatal_days(days: int | float) Age[source]

Create a postnatal age corresponding to the number of days since birth.

static postnatal_years(years: int) Age[source]
static postnatal(years: int, months: int, days: int) Age[source]
static from_iso8601_period(value: str) Age[source]

Create Age from ISO8601 duration.

A value with weeks or days is parsed into a gestational age, while a value with years, months or days is parsed into a postnatal age. An error is raised if a value for weeks and months (or years) is included at the same time.

Examples

Parse gestational age:

>>> from gpsea.model import Age
>>> gestational = Age.from_iso8601_period("P2W6D")
>>> gestational
Age(days=20.0, timeline=Timeline.GESTATIONAL)

Parse postnatal age:

>>> postnatal = Age.from_iso8601_period("P10Y")
>>> postnatal
Age(days=3652.5, timeline=Timeline.POSTNATAL)
Parameters:

value – a str with the duration (e.g. P22W3D for a gestational age or P10Y4M2D for a postnatal age).

property days: float
property timeline: Timeline
property is_gestational: bool

Return True if the age is on gestational timeline.

property is_postnatal: bool

Return True if the age is on postnatal timeline.

class gpsea.model.Timeline(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Timeline represents the stage of temporal development of an organism.

There are two stages: gestational and postnatal.

Gestational timeline starts at the last menstrual period and ends at (but does not include) birth. The postnatal timeline starts at birth and ends at death.

The Timeline overloads comparison operators. Gestational timeline is always before postnatal timeline.

GESTATIONAL = 1

Gestational timeline starts at the last menstrual period and ends at birth (excluded).

POSTNATAL = 2

Postnatal timeline starts at birth and ends at death.

class gpsea.model.Phenotype(term_id: TermId, is_observed: bool, onset: Age | None)[source]

Bases: Identified, ObservableFeature, OnsetAware

Phenotype represents a clinical sign or symptom represented as an HPO term.

The phenotype can be either present in the patient or excluded.

static from_term(term: MinimalTerm, is_observed: bool)[source]
static from_raw_parts(term_id: str | TermId, is_observed: bool, onset: Age | None = None) Phenotype[source]

Create Phenotype from a term ID and observation state.

Parameters:
  • term_id – a str with CURIE (e.g. HP:0001250) or a TermId.

  • is_observedTrue if the term ID was observed in patient or False if it was explicitly excluded.

  • onset – the Age when the phenotype was first observed in patient or None if not available.

property identifier: TermId

Get the phenotype ID; an HPO term ID most of the time.

property is_present: bool

Return True if the phenotype feature was observed in the subject or False if the feature’s presence was explicitly excluded.

property observed: bool | None

Returns a boolean for whether the phenotype is observed.

Returns:

True if this phenotype was observed in the respective patient.

Return type:

bool

property onset: Age | None

Get the age of onset of the phenotype or None if unknown or otherwise not available.

property is_observed: bool

Returns True if the phenotype was present in the respective patient.

class gpsea.model.Disease(term_id: TermId, name: str, is_observed: bool, onset: Age | None)[source]

Bases: Identified, ObservableFeature, Named, OnsetAware

Representation of a disease diagnosed (or excluded) in an investigated individual.

static from_raw_parts(term_id: str | TermId, name: str, is_observed: bool, onset: Age | None = None) Disease[source]
property identifier: TermId

Get the disease ID.

property name

Get the disease label (e.g. LEIGH SYNDROME, NUCLEAR; NULS).

property is_present: bool

Return True if the disease was diagnosed in the individual or False if it was excluded.

property onset: Age | None

Get the age of onset of the disease or None if unknown or otherwise not available.

class gpsea.model.Measurement(test_term_id: TermId, test_name: str, test_result: float, unit: TermId)[source]

Bases: Identified, Named

Representation of a GA4GH Phenopacket Measurement (numerical test result). An intended use case would be to perform a Student’s t test on numerical measurements in individuals with two difference genotype classes.

property identifier: TermId

Get the test ID, e.g. LOINC:2986-8.

property name

Get the test label (e.g. Testosterone [Mass/volume] in Serum or Plasma for LOINC 2986-8).

property test_result: float

Return True if the disease was diagnosed in the individual or False if it was excluded.

property unit: TermId

Return the unit for the test result encoded as a TermId, e.g., UCUM:ng/dL for nanogram per deciliter.

class gpsea.model.Variant(variant_info: VariantInfo, tx_annotations: Iterable[TranscriptAnnotation], genotypes: Genotypes)[source]

Bases: VariantInfoAware, FunctionalAnnotationAware, Genotyped

Variant includes three lines of information:
  • the variant data with coordinates or other info available for large imprecise SVs,

  • results of the functional annotation with respect to relevant transcripts, and

  • the genotypes for the known samples

static create_variant_from_scratch(variant_coordinates: VariantCoordinates, gene_name: str, trans_id: str, hgvs_cdna: str, is_preferred: bool, consequences: Iterable[VariantEffect], exons_effected: Sequence[int], protein_id: str | None, hgvsp: str | None, protein_effect_start: int | None, protein_effect_end: int | None, genotypes: Genotypes)[source]
property variant_info: VariantInfo

Get the representation of the variant data for sequence and symbolic variants, as well as for large imprecise SVs.

property tx_annotations: Sequence[TranscriptAnnotation]

A collection of TranscriptAnnotations that each represent results of the functional annotation of a variant with respect to single transcript of a gene.

Returns:

A sequence of TranscriptAnnotation objects

Return type:

Sequence[TranscriptAnnotation]

property genotypes: Genotypes
class gpsea.model.VariantClass(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

VariantClass represents a high-level variant category which mostly corresponds to the structural variant categories of the Variant Call Format specification, but includes type for single nucleotide variants (SNV) and multi-nucleotide variant (MNV).

DEL = 0

A deletion - a variant with a net loss of sequence from the alternative allele regardless of its size.

Both a deletion of 1 base pair and a deletion of 1000 base pairs are acceptable.

DUP = 1

Duplication (tandem or interspersed).

INS = 2

Insertion of a novel sequence.

INV = 3

Inversion of a chromosome segment.

MNV = 4

Multi-nucleotide variant (e.g. AG>CT) that is not a duplication, deletion, or insertion. May be called INDEL.

SNV = 5

Single nucleotide variant.

TRANSLOCATION = 6

A chromosomal translocation, which occurs when a chromosome breaks and the (typically two) fragmented pieces re-attach to different chromosomes.

class gpsea.model.VariantCoordinates(region: GenomicRegion, ref: str, alt: str, change_length: int)[source]

Bases: object

A representation of coordinates of sequence and symbolic variants.

Note, the breakend variants are not currently supported.

static from_vcf_literal(contig: Contig, pos: int, ref: str, alt: str)[source]

Create VariantCoordinates from a variant in VCF literal notation.

Note, this function must not be used to create a VCF symbolic variant (e.g. <DEL> or translocation). Use from_vcf_symbolic() instead.

Example

Create a variant from a VCF line: ` #CHROM  POS     ID  REF ALT ... chr1    1001    .   C   T `

We first must decide on the genome build. Most of the time, we should use GRCh38:

>>> from gpsea.model.genome import GRCh38
>>> build = GRCh38

Then, we access the contig for 'chr1':

>>> chr1 = build.contig_by_name('chr1')

Last, we create the variant coordinates:

>>> from gpsea.model import VariantCoordinates
>>> vc = VariantCoordinates.from_vcf_literal(
...     contig=chr1, pos=1001, ref='C', alt='T',
... )

Now can test the properties:

>>> vc.start, vc.end, vc.ref, vc.alt, len(vc), vc.change_length
(1000, 1001, 'C', 'T', 1, 0)
Parameters:
  • contig – a Contig for the chromosome

  • pos – a 1-based coordinate of the first base of the reference allele, as described in VCF standard

  • ref – a str with the REF allele. Should meet the requirements of the VCF standard.

  • alt – a str with the ALT allele. Should meet the requirements of the VCF standard.

static from_vcf_symbolic(contig: Contig, pos: int, end: int, ref: str, alt: str, svlen: int)[source]

Create VariantCoordinates from a variant in VCF symbolic notation.

Note, this function must not be used to create a VCF sequence/literal variant. Use from_vcf_literal() instead.

Example

Let’s create a symbolic variant from the line:

` #CHROM   POS      ID   REF   ALT     QUAL   FILTER   INFO 2        321682   .    T     <DEL>   6      PASS     SVTYPE=DEL;END=321887;SVLEN=-205 `

We first must decide on the genome build. Most of the time, we should use GRCh38:

>>> from gpsea.model.genome import GRCh38
>>> contig = GRCh38.contig_by_name('2')

Now, we create the coordinates as:

>>> vc = VariantCoordinates.from_vcf_symbolic(
...     contig=contig, pos=321682, end=321887,
...     ref='T', alt='<DEL>', svlen=-205,
... )

Now can test the properties:

>>> vc.start, vc.end, vc.ref, vc.alt, len(vc), vc.change_length
(321681, 321887, 'T', '<DEL>', 206, -205)
Parameters:
  • contig – a Contig for the chromosome

  • pos – a 1-based coordinate of the first base of the affected reference allele region

  • end – a 1-based coordinate of the last base of the affected reference allele region

  • ref – a str with the REF allele. Most of the time, it is one of {‘N’, ‘A’, ‘C’, ‘G’, ‘T’}

  • alt – a str with the ALT allele, e.g. one of {‘<DEL>’, ‘<DUP>’, ‘<INS>’, ‘<INV>’}

  • svlen – an int with change length (the difference between ref and alt allele lengths)

property chrom: str

Get the label of the chromosome/contig where the variant is located.

property start: int

Get the 0-based start coordinate (excluded) of the first base of the ref allele.

property end: int

Get the 0-based end coordinate (included) of the last base of the ref allele.

property region: GenomicRegion

Get the genomic region spanned by the ref allele.

property ref: str

Get the reference allele (e.g. “A”, “CCT”, “N”). The allele may be an empty string.

property alt: str

Get the alternate allele (e.g. “A”, “GG”, “<DEL>”).

The allele may be an empty string for sequence variants. The symbolic alternate allele follow the VCF notation and use the < and > characters (e.g. “<DEL>”, “<INS:ME:SINE>”).

property change_length: int

Get the change of length between the ref and alt alleles due to the variant presence.

See Change length of an allele for more info.

property variant_key: str

Get a readable representation of the variant’s coordinates.

For instance, X_12345_12345_C_G for a sequence variant or 22_10001_20000_INV for a symbolic variant. If the key is larger than 50 characters, the ‘ref’ and/or ‘alt’ (if over 10 bps) are changed to just show number of bps. Example: X_1000001_1000027_TAAAAAAAAAAAAAAAAAAAAAAAAAA_T -> X_1000001_1000027_--27bp--_T

Note

Both start and end coordinates use 1-based (included) coordinate system.

property variant_class: VariantClass

Get a VariantClass category.

is_structural() bool[source]

Checks if the variant coordinates use structural variant notation as described by Variant Call Format (VCF).

Ane example of structural variant notation:

chr5  101 . N <DEL> .  .  SVTYPE=DEL;END=120;SVLEN=-10

as opposed to the sequence (literal) notation:

chr5  101 . NACGTACGTAC N
Returns:

True if the variant coordinates use structural variant notation.

class gpsea.model.ImpreciseSvInfo(structural_type: TermId, variant_class: VariantClass, gene_id: str, gene_symbol: str)[source]

Bases: object

Data regarding a structural variant (SV) with imprecise breakpoint coordinates.

property structural_type: TermId

Get term ID of the structural type (e.g. SO:1000029 for chromosomal deletion).

property variant_class: VariantClass

Get a VariantClass category.

property gene_id: str

Get a str with gene identifier CURIE (e.g. HGNC:3603) or None if the identifier is not available.

property gene_symbol: str

Get a str with HGVS gene symbol (e.g. FBN1) or None if the symbol is not available.

property variant_key: str

Get a readable representation of the variant.

class gpsea.model.VariantInfo(variant_coordinates: VariantCoordinates | None = None, sv_info: ImpreciseSvInfo | None = None)[source]

Bases: object

VariantInfo consists of either variant coordinates or imprecise SV data.

The class is conceptually similar to Rust enum - only one of the fields can be set at any point in time.

property variant_coordinates: VariantCoordinates | None

Get variant coordinates if available.

has_variant_coordinates() bool[source]

Returns True if the variant coordinates are available.

property sv_info: ImpreciseSvInfo | None

Get information about large imprecise SV.

has_sv_info() bool[source]

Returns True if the variant is a large imprecise SV and the exact coordinates are thus unavailable.

property variant_key: str

Get a readable representation of the variant’s coordinates or the large SV info.

property variant_class: VariantClass

Get a VariantClass category.

is_structural() bool[source]

Test if the variant is a structural variant.

This can either be because the variant coordinates use the structural variant notation (see VariantCoordinates.is_structural()) or if the variant is large imprecise SV.

class gpsea.model.VariantInfoAware[source]

Bases: object

An entity where VariantInfo is available.

abstract property variant_info: VariantInfo

Get the variant data with coordinates or other info available for large imprecise SVs.

class gpsea.model.Genotype(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Genotype represents state of a variable locus in a diploid genome.

NO_CALL = ('.',)
HOMOZYGOUS_REFERENCE = ('0/0',)
HETEROZYGOUS = ('0/1',)
HOMOZYGOUS_ALTERNATE = ('1/1',)
HEMIZYGOUS = ('1',)
property code: str
class gpsea.model.Genotypes(samples: Iterable[SampleLabels], genotypes: Iterable[Genotype])[source]

Bases: Sized, Iterable

Genotypes is a container for mapping between sample ID and its genotype.

Let’s consider a pair of samples:

>>> a = SampleLabels('A')
>>> b = SampleLabels('B')

We can use one of the static methods to create an instance. Either a single genotype:

>>> gt = Genotypes.single(a, Genotype.HETEROZYGOUS)

or genotypes of several samples:

>>> gts = Genotypes.from_mapping({a: Genotype.HETEROZYGOUS, b: Genotype.HOMOZYGOUS_ALTERNATE})

There are 2 genotypes in the container:

>>> len(gts)
2

You can get a genotype for a sample ID:

>>> g = gts.for_sample(a)
>>> g.code
'0/1'

You will get None if the sample is not present:

>>> gts.for_sample(SampleLabels('UNKNOWN'))

You can iterate over sample-genotype pairs:

>>> for sample_id, genotype in gts:
...   print(sample_id, genotype)
A 0/1
B 1/1
static empty()[source]
static single(sample_id: SampleLabels, genotype: Genotype)[source]

A shortcut for creating Genotypes for a single sample:

>>> a = SampleLabels('A')
>>> gts = Genotypes.single(a, Genotype.HOMOZYGOUS_ALTERNATE)
>>> assert len(gts) == 1
>>> assert gts.for_sample(a) == Genotype.HOMOZYGOUS_ALTERNATE
static from_mapping(mapping: Mapping[SampleLabels, Genotype])[source]

Create Genotypes from mapping between sample IDs and genotypes.

>>> a = SampleLabels('A')
>>> b = SampleLabels('B')
>>> gts = Genotypes.from_mapping({a: Genotype.HETEROZYGOUS, b: Genotype.HOMOZYGOUS_ALTERNATE})
>>> assert len(gts) == 2
for_sample(sample_id: SampleLabels) Genotype | None[source]

Get a genotype for a sample or None if the genotype is not present.

Parameters:

sample_id – a SampleLabels with sample’s identifier.

class gpsea.model.Genotyped[source]

Bases: object

Genotyped entities

abstract property genotypes: Genotypes
genotype_for_sample(sample_id: SampleLabels) Genotype | None[source]

Get a genotype for a sample or None if the genotype is not present.

Parameters:

sample_id – a SampleLabels with sample’s identifier.

class gpsea.model.TranscriptAnnotation(gene_id: str, tx_id: str, hgvs_cdna: str | None, is_preferred: bool, variant_effects: Iterable[VariantEffect], affected_exons: Iterable[int] | None, protein_id: str | None, hgvsp: str | None, protein_effect_coordinates: Region | None)[source]

Bases: TranscriptInfoAware

TranscriptAnnotation represent a result of the functional annotation of a variant with respect to single transcript of a gene.

property gene_id: str

Get the HGVS symbol of the affected gene (e.g. SURF1).

property transcript_id: str

Get the RefSeq identifier of the transcript (e.g. NM_123456.7).

property is_preferred: bool

Return True if the transcript is the preferred transcript of a gene, such as MANE transcript, canonical Ensembl transcript.

property hgvs_cdna: str | None

Get the HGVS description of the sequence variant (e.g. NM_123456.7:c.9876G>T) or None if not available.

property variant_effects: Sequence[VariantEffect]

Get a sequence of the predicted functional variant effects.

property overlapping_exons: Sequence[int] | None

Get a sequence of 1-based exon indices (the index of the 1st exon is 1) that overlap with the variant.

property protein_id: str | None

Get the ID of the protein encoded by the transcript_id or None if the transcript is not protein-coding.

property hgvsp: str | None

Get the HGVS description of the protein sequence variant (e.g. NP_001027559.1:p.Gln421Ter) or None if not available.

property protein_effect_location: Region | None

Get the Region with start and end amino-acid coordinates affected by the variant.

class gpsea.model.VariantEffect(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

VariantEffect represents consequences of a variant on transcript that are supported by GPSEA.

>>> from gpsea.model import VariantEffect
>>> missense = VariantEffect.MISSENSE_VARIANT
>>> print(missense)
missense_variant

The VariantEffect has a curie attribute that represents the ontology class from Sequence Ontology.

>>> missense.curie
'SO:0001583'
TRANSCRIPT_ABLATION = 'SO:0001893'
SPLICE_ACCEPTOR_VARIANT = 'SO:0001574'
SPLICE_DONOR_VARIANT = 'SO:0001575'
STOP_GAINED = 'SO:0001587'
FRAMESHIFT_VARIANT = 'SO:0001589'
STOP_LOST = 'SO:0001578'
START_LOST = 'SO:0002012'
TRANSCRIPT_AMPLIFICATION = 'SO:0001889'
INFRAME_INSERTION = 'SO:0001821'
INFRAME_DELETION = 'SO:0001822'
MISSENSE_VARIANT = 'SO:0001583'
PROTEIN_ALTERING_VARIANT = 'SO:0001818'
SPLICE_REGION_VARIANT = 'SO:0001630'
SPLICE_DONOR_5TH_BASE_VARIANT = 'SO:0001787'
SPLICE_DONOR_REGION_VARIANT = 'SO:0002170'
SPLICE_POLYPYRIMIDINE_TRACT_VARIANT = 'SO:0002169'
INCOMPLETE_TERMINAL_CODON_VARIANT = 'SO:0001626'
START_RETAINED_VARIANT = 'SO:0002019'
STOP_RETAINED_VARIANT = 'SO:0001567'
SYNONYMOUS_VARIANT = 'SO:0001819'
CODING_SEQUENCE_VARIANT = 'SO:0001580'
MATURE_MIRNA_VARIANT = 'SO:0001620'
FIVE_PRIME_UTR_VARIANT = 'SO:0001623'
THREE_PRIME_UTR_VARIANT = 'SO:0001624'
NON_CODING_TRANSCRIPT_EXON_VARIANT = 'SO:0001792'
INTRON_VARIANT = 'SO:0001627'
NMD_TRANSCRIPT_VARIANT = 'SO:0001621'
NON_CODING_TRANSCRIPT_VARIANT = 'SO:0001619'
UPSTREAM_GENE_VARIANT = 'SO:0001631'
DOWNSTREAM_GENE_VARIANT = 'SO:0001632'
TFBS_ABLATION = 'SO:0001895'
TFBS_AMPLIFICATION = 'SO:0001892'
TF_BINDING_SITE_VARIANT = 'SO:0001782'
REGULATORY_REGION_ABLATION = 'SO:0001894'
REGULATORY_REGION_AMPLIFICATION = 'SO:0001891'
FEATURE_ELONGATION = 'SO:0001907'
REGULATORY_REGION_VARIANT = 'SO:0001566'
FEATURE_TRUNCATION = 'SO:0001906'
INTERGENIC_VARIANT = 'SO:0001628'
SEQUENCE_VARIANT = 'SO:0001060'
to_display() str[source]

Get a concise name of the variant effect that is suitable for showing to humans.

Example

>>> from gpsea.model import VariantEffect
>>> VariantEffect.MISSENSE_VARIANT.to_display()
'missense'
>>> VariantEffect.SPLICE_DONOR_5TH_BASE_VARIANT.to_display()
'splice donor 5th base'
Returns:

a str with the name or ‘n/a’ if the variant effect was not assigned a concise name.

static structural_so_id_to_display(so_term: TermId | str) str[source]

Get a str with a concise name for representing a Sequence Ontology (SO) term identifier.

Example

>>> from gpsea.model import VariantEffect
>>> VariantEffect.structural_so_id_to_display('SO:1000029')
'chromosomal deletion'
Parameters:

so_term – a CURIE str or a TermId with the query SO term.

Returns:

a str with the concise name for the SO term or ‘n/a’ if a name has not been assigned yet.

property curie: str

Get a compact URI (CURIE) of the variant effect (e.g. SO:0001583 for a missense variant).

class gpsea.model.TranscriptInfoAware[source]

Bases: object

The implementors know about basic gene/transcript identifiers.

abstract property gene_id: str

Returns: string: The gene symbol (e.g. SURF1)

abstract property transcript_id: str

Returns: string: The transcript RefSeq identifier (e.g. NM_123456.7)

class gpsea.model.FunctionalAnnotationAware[source]

Bases: object

abstract property tx_annotations: Sequence[TranscriptAnnotation]
get_tx_anno_by_tx_id(transcript_id: str) TranscriptAnnotation | None[source]

Given a transcript ID, this will return the TranscriptAnnotation associated with that variant and transcript.

Parameters:

transcript_id (str) – A transcript ID - i.e. ‘NM_170707.4’

Returns:

The Transcript Annotation if available.

Return type:

Optional[TranscriptAnnotation]

get_hgvs_cdna_by_tx_id(transcript_id: str) str | None[source]

Given a transcript ID, will return the hgvs cdna string associated with that variant and transcript.

Parameters:

transcript_id (str) – A transcript ID - i.e. ‘NM_170707.4’

Returns:

The hgvs cdna if available - i.e. ‘NM_170707.4:c.1824C>T’

Return type:

str or None

get_preferred_tx_annotation() TranscriptAnnotation | None[source]

Get the TranscriptAnnotation that represents the result of the functional annotation with respect to the preferred transcript of a gene.

Returns None if transcript annotations is no preferred transcript found.

Returns:

The TranscriptAnnotation with respect

to the preferred transcript or None if the preferred transcript info is not available.

Return type:

Optional[TranscriptAnnotation]

class gpsea.model.TranscriptCoordinates(identifier: str, region: GenomicRegion, exons: Iterable[GenomicRegion], cds_start: int | None, cds_end: int | None, is_preferred: bool | None = None)[source]

Bases: object

TranscriptCoordinates knows about genomic region of the transcript, exonic/intronic regions, as well as the coding and non-coding regions.

If both CDS coordinates are None, then the transcript coordinates are assumed to represent a non-coding transcript.

property identifier: str

Get transcript’s identifier (e.g. ENST00000123456.7, NM_123456.7).

property region: GenomicRegion

Get the genomic region spanned by the transcript, corresponding to 5’UTR, exonic, intronic, and 3’UTR regions.

property exons: Sequence[GenomicRegion]

Get the exon regions.

property cds_start: int | None

Get the 0-based (excluded) start coordinate of the first base of the start codon of the transcript or None if the transcript is not coding.

property cds_end: int | None

Get the 0-based (included) end coordinate of the last base of the termination codon of the transcript or None if the transcript is not coding.

is_coding() bool[source]

Return True if the transcript is coding/translated.

get_coding_base_count() int | None[source]

Calculate the number of coding bases present in the transcript. Note, the count does NOT include the termination codon since it does not code for an aminoacid. Returns: an int with the coding base count or None if the transcript is non-coding.

get_codon_count() int | None[source]

Calculate the count of codons present in the transcript. Note, the count does NOT include the termination codon!

Returns: the number of codons of the transcript or None if the transcript is non-coding.

get_five_prime_utrs() Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to 5’ untranslated regions of the transcript.

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

get_three_prime_utrs() Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to 3’ untranslated regions of the transcript.

Note, the termination codon is NOT included in the regions!

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

get_cds_regions() Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to coding regions of the transcript, including BOTH the initiation and termination codons.

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

property is_preferred: bool | None

Check if the transcript belongs among the preferred transcripts of the gene. This usually means that the transcript was chosen by the MANE project or similar.

Returns None if the info is not available.

class gpsea.model.ProteinMetadata(protein_id: str, label: str, protein_features: Iterable[ProteinFeature], protein_length: int)[source]

Bases: object

An info regarding a protein sequence, including an ID, a label, and location of protein features, such as motifs, domains, or other regions.

The information is usually retrieved from a resource such as UniprotProteinMetadataService, but it can also be created manually using from_feature_frame() function.

Example

Let’s create a protein info with a domain and a region. We must provide protein accession ID, a label, a data frame with protein features, and the number of aminoacids of the protein sequence:

>>> protein_id = 'NP_000129.3'
>>> label = 'fibrillin-1 isoform a preproprotein'
>>> protein_length = 1000

Now let’s prepare a data frame with the protein features. We will prepare a domain and a region:

>>> import pandas as pd
>>> features = [
...     {
...         "region": "Suppresor domain",
...         "category": "domain",
...         "start": 1,
...         "end": 223,
...     },
...     {
...         "region": "IP3 binding",
...         "category": "region",
...         "start": 224,
...         "end": 578,
...     },
... ]
>>> df = pd.DataFrame(features)

last, we can put the protein info together:

>>> from gpsea.model import ProteinMetadata
>>> protein_meta = ProteinMetadata.from_feature_frame(
...     protein_id=protein_id,
...     label=label,
...     features=df,
...     protein_length=protein_length,
... )

and get the expected protein info:

>>> protein_meta.protein_id
'NP_000129.3'
>>> protein_meta.label
'fibrillin-1 isoform a preproprotein'
>>> len(protein_meta.protein_features)
2
static from_feature_frame(protein_id: str, label: str, features: DataFrame, protein_length: int) ProteinMetadata[source]

Create ProteinMetadata from a user-supplied pandas DataFrame. We expect to obtain the gene symbol, protein identifier, and regions

The DataFrame should include the following columns:

region

category

start | end

Suppresor domain

domain

1

223

IP3 binding

region

224

578

The region column includes the protein feature name. The category must be one of ‘repeat’, ‘motif’, ‘domain’, or ‘region’. Use region if no other option fits. Last, start and end denote 1-based start and end coordinates of the aminoacid sequence region described by the feature. For instance, [1, 10] for the first ten aminoacids of the protein.

Parameters:
  • protein_id – the accession id of the protein, e.g. ‘NP_000129.3’.

  • label – human-readable label, e.g. ‘fibrillin-1 isoform a preproprotein’.

  • features – a dataframe with of the protein features.

  • protein_length – a positive int representing the number of aminoacids included in the protein sequence.

Raises:

ValueError – if case of issues during parsing the provided data.

static from_uniprot_json(protein_id: str, label: str, uniprot_json: IOBase | str, protein_length: int) ProteinMetadata[source]

Create ProteinMetadata from a json file that has been downloaded from UniProt.

Go to the UniProt page for the protein of interest, then go to the section “Family & Domains”, and the subsection “Features”. Click on the Download symbol. You will be presented with a JSON file for download. From this, we extract information about the gene symbol, protein identifier, and regions. This method is intended to be a backup if the API call to UniProt fails; the same information should be retrieved. See the test file “test_uniprot_json.py” for details about the JSON parsing etc.

Parameters:
  • protein_id – the accession id of the protein, e.g. ‘NP_000129.3’.

  • label – human-readable label, e.g. ‘fibrillin-1 isoform a preproprotein’.

  • uniprot_json – a str with the path or an IO object with the Uniprot JSON data.

  • protein_length – a positive int representing the number of aminoacids included in the protein sequence.

Raises:

ValueError – if case of issues during parsing the provided data.

property protein_id: str

Get the protein’s accession ID, e.g. NP_000129.3.

property label: str

Get the protein label, e.g. fibrillin-1 isoform a preproprotein.

property protein_features: Sequence[ProteinFeature]

Get a sequence of protein features.

property protein_length: int

Get the number of aminoacids of the protein sequence.

domains() Iterable[ProteinFeature][source]
Returns:

A subgroup of the protein features that correspond to protein domains.

Return type:

Iterable[ProteinFeature]

repeats() Iterable[ProteinFeature][source]
Returns:

A subgroup of the protein features that correspond to repeat regions.

Return type:

Iterable[ProteinFeature]

regions() Iterable[ProteinFeature][source]
Returns:

A subgroup of the protein features that correspond to generic regions.

Return type:

Iterable[ProteinFeature]

motifs() Iterable[ProteinFeature][source]
Returns:

A subgroup of the protein features that correspond to motifs.

Return type:

Iterable[ProteinFeature]

get_features_variant_overlaps(region: Region) Collection[ProteinFeature][source]

Get a collection of protein features that overlap with the region. :param region: the query region.

Returns:

a collection of overlapping protein features.

Return type:

Collection[ProteinFeature]

class gpsea.model.ProteinFeature[source]

Bases: object

static create(info: FeatureInfo, feature_type: FeatureType) ProteinFeature[source]
abstract property info: FeatureInfo
abstract property feature_type: FeatureType
to_string() str[source]
class gpsea.model.FeatureInfo(name: str, region: Region)[source]

Bases: object

FeatureInfo represents a protein feature (e.g. a repeated sequence given the name “ANK 1” in protein “Ankyrin repeat domain-containing protein 11”)

property name: str

Returns: string: the name of the protein feature

property start: int

Returns: integer: A 0-based (excluded) start coordinate of the protein feature.

property end: int

Returns: integer: A 0-based (included) end coordinate of the protein feature.

property region: Region

Returns: Region: a protein region spanned by the feature.

overlaps_with(region: Region) bool[source]

Covenience function to check whether a region overlaps with a protein feature.

class gpsea.model.FeatureType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

An enum representing the protein feature types supported in GPSEA.

REPEAT = 1

A repeated sequence motif or repeated domain within the protein.

MOTIF = 2

A short (usually not more than 20 amino acids) conserved sequence motif of biological significance.

DOMAIN = 3

A specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.

COILED_COIL = 4

a structural motif in proteins, characterized by two or more α-helices wrapped around each other in a supercoil. This structure is often involved in protein-protein interactions

COMPOSITIONAL_BIAS = 5

Compositional bias refers to a region in a protein where certain amino acids are overrepresented compared to the rest of the protein or compared to typical protein composition. These regions tend to have a non-random distribution of amino acids, often leading to specific structural or functional properties.

REGION = 6

A region of interest that cannot be described in other subsections.

ZINC_FINGER = 7

A zinc finger is a small, functional, independently folded domain that coordinates one or more zinc ions to stabilize its structure through cysteine and/or histidine residues.

static from_string(category: str) FeatureType[source]

Subpackages