gpsea.model package

The gpsea.model package defines data model classes used in GPSEA. We start with the top-level elements, such as Cohort and Patient, and we follow with data classes for phenotype, genotype, transcript, and protein info.

class gpsea.model.Cohort(members: Iterable[Patient], excluded_member_count: int)[source]

Bases: Sized, Iterable[Patient]

Cohort is a collection of individuals that have been preprocessed and are ready for genotype-phenotype association analysis.

all_diseases() → Set[Disease][source]: Get a set of all diseases (observed or excluded) in the cohort members.

all_measurements() → Set[Measurement][source]: Get a set of all phenotypes (observed or excluded) in the cohort members.

property all_patients: Collection[Patient]: Get a collection of all patients in the cohort.

all_phenotypes() → Set[Phenotype][source]: Get a set of all phenotypes (observed or excluded) in the cohort members.

property all_transcript_ids: Set[str]: Get a set of all transcript IDs affected by the cohort variants.

all_variant_infos() → Set[VariantInfo][source]: Get a set of variant infos observed in the cohort members.

all_variants() → Set[Variant][source]: Get a set of all variants observed in the cohort members.

count_alive() → int[source]: Get the number of individuals reported to be alive at the time of last encounter.

count_deceased() → int[source]: Get the number of individuals reported to be deceased.

count_distinct_diseases() → int[source]: Get count of distinct disease diagnoses of the cohort members.

count_distinct_hpo_terms() → int[source]: Get count of distinct HPO terms (either in present or excluded state) seen in the cohort members.

count_distinct_measurements() → int[source]: Get count of distinct measurements in the cohort members.

count_females() → int[source]: Get the number of females in the cohort.

count_males() → int[source]: Get the number of males in the cohort.

count_unknown_sex() → int[source]: Get the number of individuals with unknown sex in the cohort.

count_unknown_vital_status() → int[source]: Get the number of individuals with unknown or no reported vital status.

count_with_age_of_last_encounter() → int[source]: Get the number of individuals with a known age of last encounter.

count_with_disease_onset() → int[source]: Get the count of individuals with recorded disease onset.

static from_patients(members: Iterable[Patient], include_patients_with_no_HPO: bool = False, include_patients_with_no_variants: bool = False)[source]: Create a cohort from a sequence of patients.

get_excluded_count() → int[source]

get_patient_ids() → Set[str][source]: Get a set of the patient IDs.

get_variant_by_key(variant_key) → Variant[source]

list_all_diseases(top: int | None = None) → Sequence[Tuple[str, int]][source]

Get a sequence with counts of disease of the cohort members.

Parameters:

typing.Optional[int] – If not given, lists all diseases. Otherwise, lists only the top highest counts

Returns:

A sequence of tuples,: formatted (Disease, number of individuals with the diagnosis)

Return type:

Sequence[Tuple[Measurement, int]]

list_all_proteins(top=None) → Sequence[Tuple[str, int]][source]

Parameters:: typing.Optional[int] – If not given, lists all proteins. Otherwise, lists only the top highest counts.
Returns:: A list of tuples, formatted (protein ID string, the count of variants that affect the protein)
Return type:: list

list_all_variants(top: int | None = None) → Sequence[Tuple[str, int]][source]

Get the number of times a variant has been observed in cohort individuals.

Note, this is not the same as the number of variant alleles in the cohort individuals! This method counts each variant once, regardless of its genotype!

Parameters:: typing.Optional[int] – If not given, lists all variants. Otherwise, lists only the top highest counts
Returns:: A sequence of tuples, formatted (variant key, number of patients with that variant)
Return type:: list

list_measurements(top: int | None = None) → Sequence[Tuple[str, int]][source]

Get a sequence with counts of measurements of the cohort members.

Parameters:

typing.Optional[int] – If not given, lists all measurements. Otherwise, lists only the top highest counts

Returns:

A sequence of tuples,: formatted (Measurement, number of individuals with the measurement)

Return type:

Sequence[Tuple[Measurement, int]]

list_present_phenotypes(top: int | None = None) → Sequence[Tuple[str, int]][source]

Get a sequence with counts of HPO terms used as direct annotations of the cohort members.

Parameters:

typing.Optional[int] – If not given, lists all present phenotypes. Otherwise, lists only the top highest counts

Returns:

A sequence of tuples, formatted (phenotype CURIE,: number of patients with that phenotype)

Return type:

Sequence[Tuple[str, int]]

property total_patient_count: Get the total number of cohort members.

variant_effect_count_by_tx(tx_id: str | None = None) → Mapping[str, Mapping[str, int]][source]

Count variant effects for all transcripts or for a transcript tx_id of choice.

Parameters:

tx_id – a str with transcript accession (e.g. NM_123456.5) or None if all transcripts should be listed.

Returns:

Each transcript ID references a Counter(),: with the variant effect as the key and the count of variants with that effect on the transcript id.

Return type:

Mapping[str, Mapping[str, int]]

class gpsea.model.Patient(labels: SampleLabels, sex: Sex, age: Age | None, vital_status: VitalStatus | None, phenotypes: Iterable[Phenotype], measurements: Iterable[Measurement], diseases: Iterable[Disease], variants: Iterable[Variant])[source]

Bases: object

Patient represents a single investigated individual.

We need to know about the following attributes:

identifier(s) formatted as SampleLabels
Sex
age of last clinical encounter (optional) formatted as Age or None if not available.
vital status (optional) formatted as VitalStatus, which reports if the individual is alive or deceased plus (optional) age of death
HPO terms to represent the phenotype information, each HPO formatted as an instance of Phenotype
numerical measurements
disease diagnoses formatted as Disease
genotype information as one or more Variant

Note

We strongly recommend using the from_raw_parts() static constructor instead of __init__.

property age: Age | None: Get age of the individual or None if not available.

count_unique_diseases() → int[source]: Get the count of unique diseases in this individual.

count_unique_measurements() → int[source]: Get the count of unique measurements in this individual.

count_unique_phenotypes() → int[source]: Get the count of unique HPO terms (present or excluded) in this individual.

disease_by_id(term_id: str | TermId) → Disease | None[source]: Get a disease with an identifier or None if the individual has no such disease.

property diseases: Sequence[Disease]: Get the diseases the patient has (not) been diagnosed with.

excluded_diseases() → Iterator[Disease][source]: Get an iterator with diseases whose presence was excluded in the patient.

excluded_phenotypes() → Iterator[Phenotype][source]: Get an iterator over the excluded phenotypes of the patient.

static from_raw_parts(labels: str | SampleLabels, sex: Sex | None = None, age: Age | None = None, vital_status: VitalStatus | None = None, phenotypes: Iterable[Phenotype] = (), measurements: Iterable[Measurement] = (), diseases: Iterable[Disease] = (), variants: Iterable[Variant] = ()) → Patient[source]: Create Patient from the primary data.

property labels: SampleLabels: Get the sample identifiers.

measurement_by_id(term_id: str | TermId) → Measurement | None[source]

Get a measurement with an identifier or None if the individual has no such measurement.

Parameters:: term_id – a str with CURIE or a TermId representing the term ID of a measurement (e.g. LOINC:2986-8 for Testosterone[Mass/Vol]).
Returns:: the corresponding Measurement or None if not found in the patient.

property measurements: Sequence[Measurement]: Get the measurements in the patient.

property patient_id: str: Get a unique patient ID.

phenotype_by_id(term_id: str | TermId) → Phenotype | None[source]: Get a phenotype with an identifier or None if the individual has no such phenotype.

property phenotypes: Sequence[Phenotype]: Get the phenotypes observed and excluded in the patient.

present_diseases() → Iterator[Disease][source]: Get an iterator with diseases the patient was diagnosed with.

present_phenotypes() → Iterator[Phenotype][source]: Get an iterator over the present phenotypes of the patient.

property sex: Sex: Get the “phenotype sex” of the sample.

property variants: Sequence[Variant]: Get a list of variants observed in the patient.

property vital_status: VitalStatus | None: Get the vital status information for the individual or None if not available.

class gpsea.model.SampleLabels(label: str, meta_label: str | None = None)[source]

Bases: object

A data model for subject identifiers.

The subject has a mandatory label and an optional meta_label.

The identifiers support natural ordering, equality tests, and are hashable.

property label: str

label_summary() → str[source]: Summarize label and meta_label into a str where the sub-parts are inserted as <label>[<meta_label>].

property meta_label: str | None

class gpsea.model.Sex(value)[source]

Bases: Enum

Sex represents typical “phenotypic sex”, as would be determined by a midwife or physician at birth.

The definition is aligned with Phenopacket Schema

is_provided() → bool[source]: Return True if the sex is a known value, such as FEMALE or MALE.

is_unknown() → bool[source]: Return True if this is an UNKNOWN_SEX.

is_female() → bool[source]: Return True if the sex represents a FEMALE.

is_male() → bool[source]: Return True if the sex represents a MALE.

UNKNOWN_SEX = 0: Not assessed or not available. Maps to NCIT:C17998.

FEMALE = 1: Female sex. Maps to NCIT:C46113.

MALE = 2: Male sex. Maps to NCIT:C46112.

class gpsea.model.VitalStatus(status: gpsea.model._cohort.Status, age_of_death: gpsea.model._temporal.Age | None)[source]

Bases: object

property is_alive: bool

property is_deceased: bool

property is_unknown: bool

status: Status

age_of_death: Age | None

class gpsea.model.Status(value)[source]

Bases: Enum

UNKNOWN = 0

ALIVE = 1

DECEASED = 2

class gpsea.model.Age(days: float, timeline: Timeline)[source]

Bases: object

Representation of an age of an individual.

In GPSEA, we model the age at the resolution of a day. The age is either gestational or postnatal. A gestational age is created from the number of weeks and days since the last menstrual period and the postnatal age is created from years, months and days since birth.

Age overloads the comparison operators and can, thus, be compared or sorted. Gestational age is always before the postnatal age.

Internally, the age is always stored as the number of days.

DAYS_IN_MONTH = 30.4375

DAYS_IN_WEEK = 7

DAYS_IN_YEAR = 365.25

ISO8601PT = re.compile('^P(?P<year>\\d+Y)?(?P<month>\\d+M)?(?P<week>\\d+W)?(?P<day>\\d+D)?(T(\\d+H)?(\\d+M)?(\\d+S)?)?$')

static birth() → Age[source]: Age of an individual at birth.

property days: float

static from_iso8601_period(value: str) → Age[source]

Create Age from ISO8601 duration.

A value with weeks or days is parsed into a gestational age, while a value with years, months or days is parsed into a postnatal age.

An error is raised if a value for weeks and months (or years) is included at the same time or if the value is not a valid ISO8601 string.

Examples

Parse gestational age:

>>> from gpsea.model import Age
>>> gestational = Age.from_iso8601_period("P2W6D")
>>> gestational
Age(days=20.0, timeline=Timeline.GESTATIONAL)

Parse postnatal age:

>>> postnatal = Age.from_iso8601_period("P10Y")
>>> postnatal
Age(days=3652.5, timeline=Timeline.POSTNATAL)

Parameters:: value – a str with the duration (e.g. P22W3D for a gestational age or P10Y4M2D for a postnatal age).

static gestational(weeks: int, days: int = 0) → Age[source]

Create age from the weeks and days on the gestational timeline.

Parameters:

weeks – a non-negative int with the number of completed weeks of gestation.
days – an int in range $[0, 6]$ representing the number of completed gestational days.

static gestational_days(days: int | float) → Age[source]: Create a gestational age corresponding to the number of days since the last menstrual period.

property is_gestational: bool: Return True if the age is on gestational timeline.

property is_postnatal: bool: Return True if the age is on postnatal timeline.

static last_menstrual_period() → Age[source]: Age of an individual at last menstrual period.

static postnatal(years: int, months: int, days: int) → Age[source]

static postnatal_days(days: int | float) → Age[source]: Create a postnatal age corresponding to the number of days since birth.

static postnatal_years(years: int) → Age[source]

property timeline: Timeline

class gpsea.model.Timeline(value)[source]

Bases: Enum

Timeline represents the stage of temporal development of an organism.

There are two stages: gestational and postnatal.

Gestational timeline starts at the last menstrual period and ends at (but does not include) birth. The postnatal timeline starts at birth and ends at death.

The Timeline overloads comparison operators. Gestational timeline is always before postnatal timeline.

GESTATIONAL = 1: Gestational timeline starts at the last menstrual period and ends at birth (excluded).

POSTNATAL = 2: Postnatal timeline starts at birth and ends at death.

class gpsea.model.Phenotype(term_id: TermId, is_observed: bool, onset: Age | None)[source]

Bases: Identified, ObservableFeature, OnsetAware

Phenotype represents a clinical sign or symptom represented as an HPO term.

The phenotype can be either present in the patient or excluded.

static from_raw_parts(term_id: str | TermId, is_observed: bool, onset: Age | None = None) → Phenotype[source]

Create Phenotype from a term ID and observation state.

Parameters:

term_id – a str with CURIE (e.g. HP:0001250) or a TermId.
is_observed – True if the term ID was observed in patient or False if it was explicitly excluded.
onset – the Age when the phenotype was first observed in patient or None if not available.

static from_term(term: MinimalTerm, is_observed: bool)[source]

property identifier: TermId: Get the phenotype ID; an HPO term ID most of the time.

property is_observed: bool: Returns True if the phenotype was present in the respective patient.

property is_present: bool: Return True if the phenotype feature was observed in the subject or False if the feature’s presence was explicitly excluded.

property observed: bool | None

Returns a boolean for whether the phenotype is observed.

Returns:: True if this phenotype was observed in the respective patient.
Return type:: bool

property onset: Age | None: Get the age of onset of the phenotype or None if unknown or otherwise not available.

class gpsea.model.Disease(term_id: TermId, name: str, is_observed: bool, onset: Age | None)[source]

Bases: Identified, ObservableFeature, Named, OnsetAware

Representation of a disease diagnosed (or excluded) in an investigated individual.

static from_raw_parts(term_id: str | TermId, name: str, is_observed: bool, onset: Age | None = None) → Disease[source]

property identifier: TermId: Get the disease ID.

property is_present: bool: Return True if the disease was diagnosed in the individual or False if it was excluded.

property name: Get the disease label (e.g. LEIGH SYNDROME, NUCLEAR; NULS).

property onset: Age | None: Get the age of onset of the disease or None if unknown or otherwise not available.

class gpsea.model.Measurement(test_term_id: TermId, test_name: str, test_result: float, unit: TermId)[source]

Bases: Identified, Named

Representation of a GA4GH Phenopacket Measurement (numerical test result). An intended use case would be to perform a Student’s t test on numerical measurements in individuals with two difference genotype classes.

property identifier: TermId: Get the test ID, e.g. LOINC:2986-8.

property name: Get the test label (e.g. Testosterone [Mass/volume] in Serum or Plasma for LOINC 2986-8).

property test_result: float: Return True if the disease was diagnosed in the individual or False if it was excluded.

property unit: TermId: Return the unit for the test result encoded as a TermId, e.g., UCUM:ng/dL for nanogram per deciliter.

class gpsea.model.OnsetAware[source]

Bases: object

A mixin for classes which may know about their Age or onset.

For instance, the onset of a phenotype or a disease in an individual.

abstract property onset: Age | None: Get the age of onset or None if unknown or otherwise not available.

class gpsea.model.Variant(variant_info: VariantInfo, tx_annotations: Iterable[TranscriptAnnotation], genotypes: Genotypes)[source]

Bases: VariantInfoAware, FunctionalAnnotationAware, Genotyped

Variant includes three lines of information:

the variant data with coordinates or other info available for large imprecise SVs,
results of the functional annotation with respect to relevant transcripts, and
the genotypes for the known samples

static create_variant_from_scratch(variant_coordinates: VariantCoordinates, gene_name: str, trans_id: str, hgvs_cdna: str, is_preferred: bool, consequences: Iterable[VariantEffect], exons_effected: Sequence[int], protein_id: str | None, hgvsp: str | None, protein_effect_start: int | None, protein_effect_end: int | None, genotypes: Genotypes)[source]

property genotypes: Genotypes

property tx_annotations: Sequence[TranscriptAnnotation]

A collection of TranscriptAnnotations that each represent results of the functional annotation of a variant with respect to single transcript of a gene.

Returns:: A sequence of TranscriptAnnotation objects
Return type:: Sequence[TranscriptAnnotation]

property variant_info: VariantInfo: Get the representation of the variant data for sequence and symbolic variants, as well as for large imprecise SVs.

class gpsea.model.VariantClass(value)[source]

Bases: Enum

VariantClass represents a high-level variant category which mostly corresponds to the structural variant categories of the Variant Call Format specification, but includes type for single nucleotide variants (SNV) and multi-nucleotide variant (MNV).

DEL = 0

A deletion - a variant with a net loss of sequence from the alternative allele regardless of its size.

Both a deletion of 1 base pair and a deletion of 1000 base pairs are acceptable.

DUP = 1: Duplication (tandem or interspersed).

INS = 2: Insertion of a novel sequence.

INV = 3: Inversion of a chromosome segment.

MNV = 4: Multi-nucleotide variant (e.g. AG>CT) that is not a duplication, deletion, or insertion. May be called INDEL.

SNV = 5: Single nucleotide variant.

TRANSLOCATION = 6: A chromosomal translocation, which occurs when a chromosome breaks and the (typically two) fragmented pieces re-attach to different chromosomes.

class gpsea.model.VariantCoordinates(region: GenomicRegion, ref: str, alt: str, change_length: int)[source]

Bases: object

A representation of coordinates of sequence and symbolic variants.

Note, the breakend variants are not currently supported.

property alt: str

Get the alternate allele (e.g. “A”, “GG”, “<DEL>”).

The allele may be an empty string for sequence variants. The symbolic alternate allele follow the VCF notation and use the < and > characters (e.g. “<DEL>”, “<INS:ME:SINE>”).

property change_length: int

Get the change of length between the ref and alt alleles due to the variant presence.

See Change length of an allele for more info.

property chrom: str: Get the label of the chromosome/contig where the variant is located.

property end: int: Get the 0-based end coordinate (included) of the last base of the ref allele.

static from_vcf_literal(contig: Contig, pos: int, ref: str, alt: str)[source]

Create VariantCoordinates from a variant in VCF literal notation.

Note, this function must not be used to create a VCF symbolic variant (e.g. <DEL> or translocation). Use from_vcf_symbolic() instead.

Example

Create a variant from a VCF line: ` #CHROM POS ID REF ALT ... chr1 1001 . C T `

We first must decide on the genome build. Most of the time, we should use GRCh38:

>>> from gpsea.model.genome import GRCh38
>>> build = GRCh38

Then, we access the contig for 'chr1':

>>> chr1 = build.contig_by_name('chr1')

Last, we create the variant coordinates:

>>> from gpsea.model import VariantCoordinates
>>> vc = VariantCoordinates.from_vcf_literal(
...     contig=chr1, pos=1001, ref='C', alt='T',
... )

Now can test the properties:

>>> vc.start, vc.end, vc.ref, vc.alt, len(vc), vc.change_length
(1000, 1001, 'C', 'T', 1, 0)

Parameters:

contig – a Contig for the chromosome
pos – a 1-based coordinate of the first base of the reference allele, as described in VCF standard
ref – a str with the REF allele. Should meet the requirements of the VCF standard.
alt – a str with the ALT allele. Should meet the requirements of the VCF standard.

static from_vcf_symbolic(contig: Contig, pos: int, end: int, ref: str, alt: str, svlen: int)[source]

Create VariantCoordinates from a variant in VCF symbolic notation.

Note, this function must not be used to create a VCF sequence/literal variant. Use from_vcf_literal() instead.

Example

Let’s create a symbolic variant from the line:

` #CHROM POS ID REF ALT QUAL FILTER INFO 2 321682 . T <DEL> 6 PASS SVTYPE=DEL;END=321887;SVLEN=-205 `

We first must decide on the genome build. Most of the time, we should use GRCh38:

>>> from gpsea.model.genome import GRCh38
>>> contig = GRCh38.contig_by_name('2')

Now, we create the coordinates as:

>>> vc = VariantCoordinates.from_vcf_symbolic(
...     contig=contig, pos=321682, end=321887,
...     ref='T', alt='<DEL>', svlen=-205,
... )

Now can test the properties:

>>> vc.start, vc.end, vc.ref, vc.alt, len(vc), vc.change_length
(321681, 321887, 'T', '<DEL>', 206, -205)

Parameters:

contig – a Contig for the chromosome
pos – a 1-based coordinate of the first base of the affected reference allele region
end – a 1-based coordinate of the last base of the affected reference allele region
ref – a str with the REF allele. Most of the time, it is one of {‘N’, ‘A’, ‘C’, ‘G’, ‘T’}
alt – a str with the ALT allele, e.g. one of {‘<DEL>’, ‘<DUP>’, ‘<INS>’, ‘<INV>’}
svlen – an int with change length (the difference between ref and alt allele lengths)

is_structural() → bool[source]

Checks if the variant coordinates use structural variant notation as described by Variant Call Format (VCF).

Ane example of structural variant notation:

chr5  101 . N <DEL> .  .  SVTYPE=DEL;END=120;SVLEN=-10

as opposed to the sequence (literal) notation:

chr5  101 . NACGTACGTAC N

Returns:: True if the variant coordinates use structural variant notation.

property ref: str: Get the reference allele (e.g. “A”, “CCT”, “N”). The allele may be an empty string.

property region: GenomicRegion: Get the genomic region spanned by the ref allele.

property start: int: Get the 0-based start coordinate (excluded) of the first base of the ref allele.

property variant_class: VariantClass: Get a VariantClass category.

property variant_key: str

Get a readable representation of the variant’s coordinates.

For instance, X_12345_12345_C_G for a sequence variant or 22_10001_20000_INV for a symbolic variant. If the key is larger than 50 characters, the ‘ref’ and/or ‘alt’ (if over 10 bps) are changed to just show number of bps. Example: X_1000001_1000027_TAAAAAAAAAAAAAAAAAAAAAAAAAA_T -> X_1000001_1000027_--27bp--_T

Note

Both start and end coordinates use 1-based (included) coordinate system.

class gpsea.model.ImpreciseSvInfo(structural_type: TermId, variant_class: VariantClass, gene_id: str, gene_symbol: str)[source]

Bases: object

Data regarding a structural variant (SV) with imprecise breakpoint coordinates.

property gene_id: str: Get a str with gene identifier CURIE (e.g. HGNC:3603) or None if the identifier is not available.

property gene_symbol: str: Get a str with HGVS gene symbol (e.g. FBN1) or None if the symbol is not available.

property structural_type: TermId: Get term ID of the structural type (e.g. SO:1000029 for chromosomal deletion).

property variant_class: VariantClass: Get a VariantClass category.

property variant_key: str: Get a readable representation of the variant.

class gpsea.model.VariantInfo(variant_coordinates: VariantCoordinates | None = None, sv_info: ImpreciseSvInfo | None = None)[source]

Bases: object

VariantInfo consists of either variant coordinates or imprecise SV data.

The class is conceptually similar to Rust enum - only one of the fields can be set at any point in time.

has_sv_info() → bool[source]: Returns True if the variant is a large imprecise SV and the exact coordinates are thus unavailable.

has_variant_coordinates() → bool[source]: Returns True if the variant coordinates are available.

is_structural() → bool[source]

Test if the variant is a structural variant.

This can either be because the variant coordinates use the structural variant notation (see VariantCoordinates.is_structural()) or if the variant is large imprecise SV.

property sv_info: ImpreciseSvInfo | None: Get information about large imprecise SV.

property variant_class: VariantClass: Get a VariantClass category.

property variant_coordinates: VariantCoordinates | None: Get variant coordinates if available.

property variant_key: str: Get a readable representation of the variant’s coordinates or the large SV info.

class gpsea.model.VariantInfoAware[source]

Bases: object

An entity where VariantInfo is available.

abstract property variant_info: VariantInfo: Get the variant data with coordinates or other info available for large imprecise SVs.

class gpsea.model.Genotype(value)[source]

Bases: Enum

Genotype represents state of a variable locus in a diploid genome.

property code: str

NO_CALL = ('.',)

HOMOZYGOUS_REFERENCE = ('0/0',)

HETEROZYGOUS = ('0/1',)

HOMOZYGOUS_ALTERNATE = ('1/1',)

HEMIZYGOUS = ('1',)

class gpsea.model.Genotypes(samples: Iterable[SampleLabels], genotypes: Iterable[Genotype])[source]

Bases: Sized, Iterable

Genotypes is a container for mapping between sample ID and its genotype.

Let’s consider a pair of samples:

>>> a = SampleLabels('A')
>>> b = SampleLabels('B')

We can use one of the static methods to create an instance. Either a single genotype:

>>> gt = Genotypes.single(a, Genotype.HETEROZYGOUS)

or genotypes of several samples:

>>> gts = Genotypes.from_mapping({a: Genotype.HETEROZYGOUS, b: Genotype.HOMOZYGOUS_ALTERNATE})

There are 2 genotypes in the container:

>>> len(gts)
2

You can get a genotype for a sample ID:

>>> g = gts.for_sample(a)
>>> g.code
'0/1'

You will get None if the sample is not present:

>>> gts.for_sample(SampleLabels('UNKNOWN'))

You can iterate over sample-genotype pairs:

>>> for sample_id, genotype in gts:
...   print(sample_id, genotype)
A 0/1
B 1/1

static empty() → Genotypes[source]

for_sample(sample_id: SampleLabels) → Genotype | None[source]

Get a genotype for a sample or None if the genotype is not present.

Parameters:: sample_id – a SampleLabels with sample’s identifier.

static from_mapping(mapping: Mapping[SampleLabels, Genotype]) → Genotypes[source]

Create Genotypes from mapping between sample IDs and genotypes.

>>> a = SampleLabels('A')
>>> b = SampleLabels('B')
>>> gts = Genotypes.from_mapping({a: Genotype.HETEROZYGOUS, b: Genotype.HOMOZYGOUS_ALTERNATE})

>>> assert len(gts) == 2

static single(sample_id: SampleLabels, genotype: Genotype) → Genotypes[source]

A shortcut for creating Genotypes for a single sample:

>>> a = SampleLabels('A')
>>> gts = Genotypes.single(a, Genotype.HOMOZYGOUS_ALTERNATE)

>>> assert len(gts) == 1
>>> assert gts.for_sample(a) == Genotype.HOMOZYGOUS_ALTERNATE

class gpsea.model.Genotyped[source]

Bases: object

Genotyped entities

genotype_for_sample(sample_id: SampleLabels) → Genotype | None[source]

Get a genotype for a sample or None if the genotype is not present.

Parameters:: sample_id – a SampleLabels with sample’s identifier.

abstract property genotypes: Genotypes

class gpsea.model.TranscriptAnnotation(gene_id: str, tx_id: str, hgvs_cdna: str | None, is_preferred: bool, variant_effects: Iterable[VariantEffect], affected_exons: Iterable[int] | None, protein_id: str | None, hgvsp: str | None, protein_effect_coordinates: Region | None)[source]

Bases: TranscriptInfoAware

TranscriptAnnotation represent a result of the functional annotation of a variant with respect to single transcript of a gene.

property gene_id: str: Get the HGVS symbol of the affected gene (e.g. SURF1).

property hgvs_cdna: str | None: Get the HGVS description of the sequence variant (e.g. NM_123456.7:c.9876G>T) or None if not available.

property hgvsp: str | None: Get the HGVS description of the protein sequence variant (e.g. NP_001027559.1:p.Gln421Ter) or None if not available.

property is_preferred: bool: Return True if the transcript is the preferred transcript of a gene, such as MANE transcript, canonical Ensembl transcript.

property overlapping_exons: Sequence[int] | None: Get a sequence of 1-based exon indices (the index of the 1st exon is 1) that overlap with the variant.

property protein_effect_location: Region | None: Get the Region with start and end amino-acid coordinates affected by the variant.

property protein_id: str | None: Get the ID of the protein encoded by the transcript_id or None if the transcript is not protein-coding.

property transcript_id: str: Get the RefSeq identifier of the transcript (e.g. NM_123456.7).

property variant_effects: Sequence[VariantEffect]: Get a sequence of the predicted functional variant effects.

class gpsea.model.VariantEffect(value)[source]

Bases: Enum

VariantEffect represents consequences of a variant on transcript that are supported by GPSEA.

>>> from gpsea.model import VariantEffect
>>> missense = VariantEffect.MISSENSE_VARIANT
>>> print(missense)
missense_variant

The VariantEffect has a curie attribute that represents the ontology class from Sequence Ontology.

>>> missense.curie
'SO:0001583'

to_display() → str[source]

Get a concise name of the variant effect that is suitable for showing to humans.

Example

>>> from gpsea.model import VariantEffect
>>> VariantEffect.MISSENSE_VARIANT.to_display()
'missense'
>>> VariantEffect.SPLICE_DONOR_5TH_BASE_VARIANT.to_display()
'splice donor 5th base'

Returns:: a str with the name or ‘n/a’ if the variant effect was not assigned a concise name.

static structural_so_id_to_display(so_term: TermId | str) → str[source]

Get a str with a concise name for representing a Sequence Ontology (SO) term identifier.

Example

>>> from gpsea.model import VariantEffect
>>> VariantEffect.structural_so_id_to_display('SO:1000029')
'chromosomal deletion'

Parameters:: so_term – a CURIE str or a TermId with the query SO term.
Returns:: a str with the concise name for the SO term or ‘n/a’ if a name has not been assigned yet.

property curie: str: Get a compact URI (CURIE) of the variant effect (e.g. SO:0001583 for a missense variant).

TRANSCRIPT_AMPLIFICATION = 'SO:0001889'

TRANSCRIPT_ABLATION = 'SO:0001893'

TRANSCRIPT_TRANSLOCATION = 'SO:0001883'

SPLICE_ACCEPTOR_VARIANT = 'SO:0001574'

SPLICE_DONOR_VARIANT = 'SO:0001575'

STOP_GAINED = 'SO:0001587'

FRAMESHIFT_VARIANT = 'SO:0001589'

STOP_LOST = 'SO:0001578'

START_LOST = 'SO:0002012'

INFRAME_INSERTION = 'SO:0001821'

INFRAME_DELETION = 'SO:0001822'

MISSENSE_VARIANT = 'SO:0001583'

PROTEIN_ALTERING_VARIANT = 'SO:0001818'

SPLICE_REGION_VARIANT = 'SO:0001630'

SPLICE_DONOR_5TH_BASE_VARIANT = 'SO:0001787'

SPLICE_DONOR_REGION_VARIANT = 'SO:0002170'

SPLICE_POLYPYRIMIDINE_TRACT_VARIANT = 'SO:0002169'

INCOMPLETE_TERMINAL_CODON_VARIANT = 'SO:0001626'

START_RETAINED_VARIANT = 'SO:0002019'

STOP_RETAINED_VARIANT = 'SO:0001567'

SYNONYMOUS_VARIANT = 'SO:0001819'

CODING_SEQUENCE_VARIANT = 'SO:0001580'

MATURE_MIRNA_VARIANT = 'SO:0001620'

FIVE_PRIME_UTR_VARIANT = 'SO:0001623'

THREE_PRIME_UTR_VARIANT = 'SO:0001624'

NON_CODING_TRANSCRIPT_EXON_VARIANT = 'SO:0001792'

INTRON_VARIANT = 'SO:0001627'

NMD_TRANSCRIPT_VARIANT = 'SO:0001621'

NON_CODING_TRANSCRIPT_VARIANT = 'SO:0001619'

UPSTREAM_GENE_VARIANT = 'SO:0001631'

DOWNSTREAM_GENE_VARIANT = 'SO:0001632'

TFBS_ABLATION = 'SO:0001895'

TFBS_AMPLIFICATION = 'SO:0001892'

TF_BINDING_SITE_VARIANT = 'SO:0001782'

REGULATORY_REGION_ABLATION = 'SO:0001894'

REGULATORY_REGION_AMPLIFICATION = 'SO:0001891'

FEATURE_ELONGATION = 'SO:0001907'

REGULATORY_REGION_VARIANT = 'SO:0001566'

FEATURE_TRUNCATION = 'SO:0001906'

INTERGENIC_VARIANT = 'SO:0001628'

SEQUENCE_VARIANT = 'SO:0001060'

class gpsea.model.TranscriptInfoAware[source]

Bases: object

The implementors know about basic gene/transcript identifiers.

abstract property gene_id: str: Returns: string: The gene symbol (e.g. SURF1)

abstract property transcript_id: str: Returns: string: The transcript RefSeq identifier (e.g. NM_123456.7)

class gpsea.model.FunctionalAnnotationAware[source]

Bases: object

get_hgvs_cdna_by_tx_id(transcript_id: str) → str | None[source]

Given a transcript ID, will return the hgvs cdna string associated with that variant and transcript.

Parameters:: transcript_id (str) – A transcript ID - i.e. ‘NM_170707.4’
Returns:: The hgvs cdna if available - i.e. ‘NM_170707.4:c.1824C>T’
Return type:: str or None

get_preferred_tx_annotation() → TranscriptAnnotation | None[source]

Get the TranscriptAnnotation that represents the result of the functional annotation with respect to the preferred transcript of a gene.

Returns None if transcript annotations is no preferred transcript found.

Returns:

The TranscriptAnnotation with respect: to the preferred transcript or None if the preferred transcript info is not available.

Return type:

Optional[TranscriptAnnotation]

get_tx_anno_by_tx_id(transcript_id: str) → TranscriptAnnotation | None[source]

Given a transcript ID, this will return the TranscriptAnnotation associated with that variant and transcript.

Parameters:: transcript_id (str) – A transcript ID - i.e. ‘NM_170707.4’
Returns:: The Transcript Annotation if available.
Return type:: Optional[TranscriptAnnotation]

abstract property tx_annotations: Sequence[TranscriptAnnotation]

class gpsea.model.TranscriptCoordinates(identifier: str, region: GenomicRegion, exons: Iterable[GenomicRegion], cds_start: int | None, cds_end: int | None, is_preferred: bool | None = None)[source]

Bases: object

TranscriptCoordinates knows about genomic region of the transcript, exonic/intronic regions, as well as the coding and non-coding regions.

If both CDS coordinates are None, then the transcript coordinates are assumed to represent a non-coding transcript.

property cds_end: int | None: Get the 0-based (included) end coordinate of the last base of the termination codon of the transcript or None if the transcript is not coding.

property cds_start: int | None: Get the 0-based (excluded) start coordinate of the first base of the start codon of the transcript or None if the transcript is not coding.

property exons: Sequence[GenomicRegion]: Get the exon regions.

get_cds_regions() → Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to coding regions of the transcript, including BOTH the initiation and termination codons.

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

get_coding_base_count() → int | None[source]: Calculate the number of coding bases present in the transcript. Note, the count does NOT include the termination codon since it does not code for an aminoacid. Returns: an int with the coding base count or None if the transcript is non-coding.

get_codon_count() → int | None[source]

Calculate the count of codons present in the transcript. Note, the count does NOT include the termination codon!

Returns: the number of codons of the transcript or None if the transcript is non-coding.

get_five_prime_utrs() → Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to 5’ untranslated regions of the transcript.

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

get_three_prime_utrs() → Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to 3’ untranslated regions of the transcript.

Note, the termination codon is NOT included in the regions!

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

property identifier: str: Get transcript’s identifier (e.g. ENST00000123456.7, NM_123456.7).

is_coding() → bool[source]: Return True if the transcript is coding/translated.

property is_preferred: bool | None

Check if the transcript belongs among the preferred transcripts of the gene. This usually means that the transcript was chosen by the MANE project or similar.

Returns None if the info is not available.

property region: GenomicRegion: Get the genomic region spanned by the transcript, corresponding to 5’UTR, exonic, intronic, and 3’UTR regions.

class gpsea.model.ProteinMetadata(protein_id: str, label: str, protein_features: Iterable[ProteinFeature], protein_length: int)[source]

Bases: object

An info regarding a protein sequence, including an ID, a label, and location of protein features, such as motifs, domains, or other regions.

The information is usually retrieved from a resource such as UniprotProteinMetadataService, but it can also be created manually using from_feature_frame() function.

Example

Let’s create a protein info with a domain and a region. We must provide protein accession ID, a label, a data frame with protein features, and the number of aminoacids of the protein sequence:

>>> protein_id = 'NP_000129.3'
>>> label = 'fibrillin-1 isoform a preproprotein'
>>> protein_length = 1000

Now let’s prepare a data frame with the protein features. We will prepare a domain and a region:

>>> import pandas as pd
>>> features = [
...     {
...         "region": "Suppresor domain",
...         "category": "domain",
...         "start": 1,
...         "end": 223,
...     },
...     {
...         "region": "IP3 binding",
...         "category": "region",
...         "start": 224,
...         "end": 578,
...     },
... ]
>>> df = pd.DataFrame(features)

last, we can put the protein info together:

>>> from gpsea.model import ProteinMetadata
>>> protein_meta = ProteinMetadata.from_feature_frame(
...     protein_id=protein_id,
...     label=label,
...     features=df,
...     protein_length=protein_length,
... )

and get the expected protein info:

>>> protein_meta.protein_id
'NP_000129.3'
>>> protein_meta.label
'fibrillin-1 isoform a preproprotein'
>>> len(protein_meta.protein_features)
2

domains() → Iterable[ProteinFeature][source]

Returns:: A subgroup of the protein features that correspond to protein domains.
Return type:: Iterable[ProteinFeature]

static from_feature_frame(protein_id: str, label: str, features: DataFrame, protein_length: int) → ProteinMetadata[source]

Create ProteinMetadata from a user-supplied pandas DataFrame. We expect to obtain the gene symbol, protein identifier, and regions

The DataFrame should include the following columns:

region	category	start \| end
Suppresor domain	domain	1	223
IP3 binding	region	224	578

The region column includes the protein feature name. The category is a str representing region’s feature type. Use ‘region’ if no better type fits. Last, start and end denote 1-based start and end coordinates of the aminoacid sequence region described by the feature. For instance, [1, 10] for the first ten aminoacids of the protein.

Parameters:

protein_id – the accession id of the protein, e.g. ‘NP_000129.3’.
label – human-readable label, e.g. ‘fibrillin-1 isoform a preproprotein’.
features – a dataframe with of the protein features.
protein_length – a positive int representing the number of aminoacids included in the protein sequence.

Raises:

ValueError – if case of issues during parsing the provided data.

static from_uniprot_json(protein_id: str, label: str, uniprot_json: IOBase | str, protein_length: int) → ProteinMetadata[source]

Create ProteinMetadata from a json file that has been downloaded from UniProt.

Go to the UniProt page for the protein of interest, then go to the section “Family & Domains”, and the subsection “Features”. Click on the Download symbol. You will be presented with a JSON file for download. From this, we extract information about the gene symbol, protein identifier, and regions. This method is intended to be a backup if the API call to UniProt fails; the same information should be retrieved. See the test file “test_uniprot_json.py” for details about the JSON parsing etc.

Parameters:

protein_id – the accession id of the protein, e.g. ‘NP_000129.3’.
label – human-readable label, e.g. ‘fibrillin-1 isoform a preproprotein’.
uniprot_json – a str with the path or an IO object with the Uniprot JSON data.
protein_length – a positive int representing the number of aminoacids included in the protein sequence.

Raises:

ValueError – if case of issues during parsing the provided data.

get_features_variant_overlaps(region: Region) → Collection[ProteinFeature][source]

Get a collection of protein features that overlap with the region. :param region: the query region.

Returns:: a collection of overlapping protein features.
Return type:: Collection[ProteinFeature]

property label: str: Get the protein label, e.g. fibrillin-1 isoform a preproprotein.

motifs() → Iterable[ProteinFeature][source]

Returns:: A subgroup of the protein features that correspond to motifs.
Return type:: Iterable[ProteinFeature]

property protein_features: Sequence[ProteinFeature]: Get a sequence of protein features.

property protein_id: str: Get the protein’s accession ID, e.g. NP_000129.3.

property protein_length: int: Get the number of aminoacids of the protein sequence.

regions() → Iterable[ProteinFeature][source]

Returns:: A subgroup of the protein features that correspond to generic regions.
Return type:: Iterable[ProteinFeature]

repeats() → Iterable[ProteinFeature][source]

Returns:: A subgroup of the protein features that correspond to repeat regions.
Return type:: Iterable[ProteinFeature]

class gpsea.model.ProteinFeature[source]

Bases: object

static create(info: FeatureInfo, feature_type: str) → ProteinFeature[source]

abstract property feature_type: str

abstract property info: FeatureInfo

to_string() → str[source]

class gpsea.model.FeatureInfo(name: str, region: Region)[source]

Bases: object

FeatureInfo represents a protein feature (e.g. a repeated sequence given the name “ANK 1” in protein “Ankyrin repeat domain-containing protein 11”)

property end: int: Returns: integer: A 0-based (included) end coordinate of the protein feature.

property name: str: Returns: string: the name of the protein feature

overlaps_with(region: Region) → bool[source]: Covenience function to check whether a region overlaps with a protein feature.

property region: Region: Returns: Region: a protein region spanned by the feature.

property start: int: Returns: integer: A 0-based (excluded) start coordinate of the protein feature.

class gpsea.model.FeatureType(value)[source]

Bases: Enum

An enum representing the protein feature types supported in GPSEA.

Use from_string() to parse a feature from Uniprot:

>>> from gpsea.model import FeatureType
>>> val = "Dna Binding"
>>> feature_type = FeatureType.from_string(val)
>>> feature_type.name
'DNA_BINDING'

static from_string(category: str) → FeatureType[source]

static deprecation_warning()[source]

REPEAT = 1: A repeated sequence motif or repeated domain within the protein.

MOTIF = 2: A short (usually not more than 20 amino acids) conserved sequence motif of biological significance.

DOMAIN = 3: A specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.

COILED_COIL = 4: a structural motif in proteins, characterized by two or more α-helices wrapped around each other in a supercoil. This structure is often involved in protein-protein interactions

COMPOSITIONAL_BIAS = 5: Compositional bias refers to a region in a protein where certain amino acids are overrepresented compared to the rest of the protein or compared to typical protein composition. These regions tend to have a non-random distribution of amino acids, often leading to specific structural or functional properties.

REGION = 6: A region of interest that cannot be described in other subsections.

ZINC_FINGER = 7: A zinc finger is a small, functional, independently folded domain that coordinates one or more zinc ions to stabilize its structure through cysteine and/or histidine residues.

TOPOLOGICAL_DOMAIN = 8: non-membrane region of a membrane-spanning protein

TRANSMEMBRANE = 9: Section of a protein that goes through the membrane of the cell or an organelle

DNA_BINDING = 10: A region with a known DNA binding activity.

Subpackages

gpsea.model.genome package