gpsea.model package

The gpsea.model package defines data model classes used in GPSEA. We start with the top-level elements, such as Cohort and Patient, and we follow with data classes for phenotype, genotype, transcript, and protein info.

class gpsea.model.Cohort(members: Iterable[Patient], excluded_member_count: int)[source]

Bases: Sized, Iterable[Patient]

static from_patients(members: Iterable[Patient], include_patients_with_no_HPO: bool = False, include_patients_with_no_variants: bool = False)[source]

Create a cohort from a sequence of patients.

property all_patients: Collection[Patient]

Get a collection of all patients in the cohort.

all_phenotypes() Set[Phenotype][source]

Get a set of all phenotypes (observed or excluded) in the cohort members.

all_diseases() Set[Disease][source]

Get a set of all diseases (observed or excluded) in the cohort members.

all_variants() Set[Variant][source]

Get a set of all variants observed in the cohort members.

property all_transcript_ids: Set[str]

Get a set of all transcript IDs affected by the cohort variants.

property total_patient_count

Get the total number of cohort members.

get_patient_ids() Set[str][source]

Get a set of the patient IDs.

list_present_phenotypes(top: int | None = None) Sequence[Tuple[str, int]][source]

Get a sequence with counts of HPO terms used as direct annotations of the cohort members.

Parameters:

typing.Optional[int] (top) – If not given, lists all present phenotypes. Otherwise, lists only the top highest counts

Returns:

A sequence of tuples, formatted (phenotype CURIE,

number of patients with that phenotype)

Return type:

Sequence[Tuple[str, int]]

list_all_diseases(top=None) Sequence[Tuple[TermId, int]][source]
list_all_variants(top=None) Sequence[Tuple[str, int]][source]
Parameters:

typing.Optional[int] (top) – If not given, lists all variants. Otherwise, lists only the top highest counts

Returns:

A sequence of tuples, formatted (variant key, number of patients with that variant)

Return type:

list

list_all_proteins(top=None) Sequence[Tuple[str, int]][source]
Parameters:

typing.Optional[int] (top) – If not given, lists all proteins. Otherwise, lists only the top highest counts.

Returns:

A list of tuples, formatted (protein ID string, the count of variants that affect the protein)

Return type:

list

variant_effect_count_by_tx(tx_id: str | None = None) Mapping[str, Mapping[str, int]][source]

Count variant effects for all transcripts or for a transcript tx_id of choice.

Parameters:

tx_id – a str with transcript accession (e.g. NM_123456.5) or None if all transcripts should be listed.

Returns:

Each transcript ID references a Counter(), with the variant effect as the key

and the count of variants with that effect on the transcript id.

Return type:

mapping

get_excluded_count() int[source]
get_variant_by_key(variant_key) Variant[source]
class gpsea.model.Patient(labels: SampleLabels, sex: Sex, phenotypes: Iterable[Phenotype], diseases: Iterable[Disease], variants: Iterable[Variant])[source]

Bases: object

Patient represents a single investigated individual.

Note

We strongly recommend using the from_raw_parts() static constructor instead of __init__.

static from_raw_parts(labels: SampleLabels, sex: Sex | None, phenotypes: Iterable[Phenotype], diseases: Iterable[Disease], variants: Iterable[Variant]) Patient[source]

Create Patient from the primary data.

property patient_id: str

Get a unique patient ID.

property labels: SampleLabels

Get the sample identifiers.

property sex: Sex

Get the “phenotype sex” of the sample.

property phenotypes: Sequence[Phenotype]

Get the phenotypes observed and excluded in the patient.

property diseases: Sequence[Disease]

Get the diseases the patient has (not) been diagnosed with.

property variants: Sequence[Variant]

Get a list of variants observed in the patient.

present_phenotypes() Iterator[Phenotype][source]

Get an iterator over the present phenotypes of the patient.

excluded_phenotypes() Iterator[Phenotype][source]

Get an iterator over the excluded phenotypes of the patient.

present_diseases() Iterator[Disease][source]

Get an iterator with diseases the patient was diagnosed with.

excluded_diseases() Iterator[Disease][source]

Get an iterator with diseases whose presence was excluded in the patient.

class gpsea.model.SampleLabels(label: str, meta_label: str | None = None)[source]

Bases: object

A data model for subject identifiers.

The subject has a mandatory label and an optional meta_label.

The identifiers support natural ordering, equality tests, and are hashable.

property label: str
property meta_label: str | None
label_summary() str[source]

Summarize label and meta_label into a str where the sub-parts are inserted as <label>[<meta_label>].

class gpsea.model.Sex(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Sex represents typical “phenotypic sex”, as would be determined by a midwife or physician at birth.

The definition is aligned with Phenopacket Schema

UNKNOWN_SEX = 0

Not assessed or not available. Maps to NCIT:C17998.

FEMALE = 1

Female sex. Maps to NCIT:C46113.

MALE = 2

Male sex. Maps to NCIT:C46112.

is_provided() bool[source]

Return True if the sex is a known value, such as FEMALE or MALE.

is_unknown() bool[source]

Return True if this is an UNKNOWN_SEX.

is_female() bool[source]

Return True if the sex represents a FEMALE.

is_male() bool[source]

Return True if the sex represents a MALE.

class gpsea.model.Phenotype(term_id: TermId, is_observed: bool)[source]

Bases: Identified, ObservableFeature

Phenotype represents a clinical sign or symptom represented as an HPO term.

The phenotype can be either present in the patient or excluded.

static from_term(term: MinimalTerm, is_observed: bool)[source]
static from_raw_parts(term_id: str | TermId, is_observed: bool) Phenotype[source]

Create Phenotype from a term ID and observation state.

Parameters:
  • term_id – a str with CURIE (e.g. HP:0001250) or a TermId.

  • is_observedTrue if the term ID was observed in patient or False if it was explicitly excluded.

property identifier: TermId

Get the phenotype ID; an HPO term ID most of the time.

property is_present: bool

Return True if the phenotype feature was observed in the subject or False if the feature’s presence was explicitly excluded.

property observed: bool | None

Returns a boolean for whether the phenotype is observed.

Returns:

True if this phenotype was observed in the respective patient.

Return type:

boolean

property is_observed: bool

Returns True if the phenotype was present in the respective patient.

class gpsea.model.Disease(term_id: TermId, name: str, is_observed: bool)[source]

Bases: Identified, ObservableFeature, Named

Representation of a disease diagnosed (or excluded) in an investigated individual.

property identifier: TermId

Get the disease ID.

property name

Get the disease label (e.g. LEIGH SYNDROME, NUCLEAR; NULS).

property is_present: bool

Return True if the disease was diagnosed in the individual or False if it was excluded.

class gpsea.model.Variant(variant_info: VariantInfo, tx_annotations: Iterable[TranscriptAnnotation], genotypes: Genotypes)[source]

Bases: VariantInfoAware, FunctionalAnnotationAware, Genotyped

Variant includes three lines of information:
  • the variant data with coordinates or other info available for large imprecise SVs,

  • results of the functional annotation with respect to relevant transcripts, and

  • the genotypes for the known samples

static create_variant_from_scratch(variant_coordinates: VariantCoordinates, gene_name: str, trans_id: str, hgvs_cdna: str, is_preferred: bool, consequences: Iterable[VariantEffect], exons_effected: Sequence[int], protein_id: str | None, hgvsp: str | None, protein_effect_start: int | None, protein_effect_end: int | None, genotypes: Genotypes)[source]
property variant_info: VariantInfo

Get the representation of the variant data for sequence and symbolic variants, as well as for large imprecise SVs.

property tx_annotations: Sequence[TranscriptAnnotation]

A collection of TranscriptAnnotations that each represent results of the functional annotation of a variant with respect to single transcript of a gene.

Returns:

A sequence of TranscriptAnnotation objects

Return type:

Sequence[TranscriptAnnotation]

property genotypes: Genotypes
class gpsea.model.VariantClass(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

VariantClass represents a high-level variant category which mostly corresponds to the structural variant categories of the Variant Call Format specification, but includes type for single nucleotide variants (SNV) and multi-nucleotide variant (MNV).

DEL = 0

A deletion - a variant with a net loss of sequence from the alternative allele regardless of its size.

Both a deletion of 1 base pair and a deletion of 1000 base pairs are acceptable.

DUP = 1

Duplication (tandem or interspersed).

INS = 2

Insertion of a novel sequence.

INV = 3

Inversion of a chromosome segment.

MNV = 4

Multi-nucleotide variant (e.g. AG>CT) that is not a duplication, deletion, or insertion. May be called INDEL.

SNV = 5

Single nucleotide variant.

BND = 6

A breakend.

class gpsea.model.VariantCoordinates(region: GenomicRegion, ref: str, alt: str, change_length: int)[source]

Bases: object

A representation of coordinates of sequence and symbolic variants.

Note, the breakend variants are not currently supported.

static from_vcf_literal(contig: Contig, pos: int, ref: str, alt: str)[source]

Create VariantCoordinates from a variant in VCF literal notation.

Note, this function must not be used to create a VCF symbolic variant (e.g. <DEL> or translocation). Use from_vcf_symbolic() instead.

Example

Create a variant from a VCF line: ` #CHROM  POS     ID  REF ALT ... chr1    1001    .   C   T `

We first must decide on the genome build. Most of the time, we should use GRCh38:

>>> from gpsea.model.genome import GRCh38
>>> build = GRCh38

Then, we access the contig for 'chr1':

>>> chr1 = build.contig_by_name('chr1')

Last, we create the variant coordinates:

>>> from gpsea.model import VariantCoordinates
>>> vc = VariantCoordinates.from_vcf_literal(
...     contig=chr1, pos=1001, ref='C', alt='T',
... )

Now can test the properties:

>>> vc.start, vc.end, vc.ref, vc.alt, len(vc), vc.change_length
(1000, 1001, 'C', 'T', 1, 0)
Parameters:
  • contig – a Contig for the chromosome

  • pos – a 1-based coordinate of the first base of the reference allele, as described in VCF standard

  • ref – a str with the REF allele. Should meet the requirements of the VCF standard.

  • alt – a str with the ALT allele. Should meet the requirements of the VCF standard.

static from_vcf_symbolic(contig: Contig, pos: int, end: int, ref: str, alt: str, svlen: int)[source]

Create VariantCoordinates from a variant in VCF symbolic notation.

Note, this function must not be used to create a VCF sequence/literal variant. Use from_vcf_literal() instead.

Example

Let’s create a symbolic variant from the line:

` #CHROM   POS      ID   REF   ALT     QUAL   FILTER   INFO 2        321682   .    T     <DEL>   6      PASS     SVTYPE=DEL;END=321887;SVLEN=-205 `

We first must decide on the genome build. Most of the time, we should use GRCh38:

>>> from gpsea.model.genome import GRCh38
>>> contig = GRCh38.contig_by_name('2')

Now, we create the coordinates as:

>>> vc = VariantCoordinates.from_vcf_symbolic(
...     contig=contig, pos=321682, end=321887,
...     ref='T', alt='<DEL>', svlen=-205,
... )

Now can test the properties:

>>> vc.start, vc.end, vc.ref, vc.alt, len(vc), vc.change_length
(321681, 321887, 'T', '<DEL>', 206, -205)
Parameters:
  • contig – a Contig for the chromosome

  • pos – a 1-based coordinate of the first base of the affected reference allele region

  • end – a 1-based coordinate of the last base of the affected reference allele region

  • ref – a str with the REF allele. Most of the time, it is one of {‘N’, ‘A’, ‘C’, ‘G’, ‘T’}

  • alt – a str with the ALT allele, e.g. one of {‘<DEL>’, ‘<DUP>’, ‘<INS>’, ‘<INV>’}

  • svlen – an int with change length (the difference between ref and alt allele lengths)

property chrom: str

Get the label of the chromosome/contig where the variant is located.

property start: int

Get the 0-based start coordinate (excluded) of the first base of the ref allele.

property end: int

Get the 0-based end coordinate (included) of the last base of the ref allele.

property region: GenomicRegion

Get the genomic region spanned by the ref allele.

property ref: str

Get the reference allele (e.g. “A”, “CCT”, “N”). The allele may be an empty string.

property alt: str

Get the alternate allele (e.g. “A”, “GG”, “<DEL>”).

The allele may be an empty string for sequence variants. The symbolic alternate allele follow the VCF notation and use the < and > characters (e.g. “<DEL>”, “<INS:ME:SINE>”).

property change_length: int

Get the change of length between the ref and alt alleles due to the variant presence.

See Change length of an allele for more info.

property variant_key: str

Get a readable representation of the variant’s coordinates.

For instance, X_12345_12345_C_G for a sequence variant or 22_10001_20000_INV for a symbolic variant. If the key is larger than 50 characters, the ‘ref’ and/or ‘alt’ (if over 10 bps) are changed to just show number of bps. Example: X_1000001_1000027_TAAAAAAAAAAAAAAAAAAAAAAAAAA_T -> X_1000001_1000027_--27bp--_T

Note

Both start and end coordinates use 1-based (included) coordinate system.

property variant_class: VariantClass

Get a VariantClass category.

is_structural() bool[source]

Checks if the variant coordinates use structural variant notation as described by Variant Call Format (VCF).

Ane example of structural variant notation:

chr5  101 . N <DEL> .  .  SVTYPE=DEL;END=120;SVLEN=-10

as opposed to the sequence (literal) notation:

chr5  101 . NACGTACGTAC N
Returns:

True if the variant coordinates use structural variant notation.

class gpsea.model.ImpreciseSvInfo(structural_type: TermId, variant_class: VariantClass, gene_id: str, gene_symbol: str)[source]

Bases: object

Data regarding a structural variant (SV) with imprecise breakpoint coordinates.

property structural_type: TermId

Get term ID of the structural type (e.g. SO:1000029 for chromosomal deletion).

property variant_class: VariantClass

Get a VariantClass category.

property gene_id: str

Get a str with gene identifier CURIE (e.g. HGNC:3603) or None if the identifier is not available.

property gene_symbol: str

Get a str with HGVS gene symbol (e.g. FBN1) or None if the symbol is not available.

property variant_key: str

Get a readable representation of the variant.

class gpsea.model.VariantInfo(variant_coordinates: VariantCoordinates | None = None, sv_info: ImpreciseSvInfo | None = None)[source]

Bases: object

VariantInfo consists of either variant coordinates or imprecise SV data.

The class is conceptually similar to Rust enum - only one of the fields can be set at any point in time.

property variant_coordinates: VariantCoordinates | None

Get variant coordinates if available.

has_variant_coordinates() bool[source]

Returns True if the variant coordinates are available.

property sv_info: ImpreciseSvInfo | None

Get information about large imprecise SV.

has_sv_info() bool[source]

Returns True if the variant is a large imprecise SV and the exact coordinates are thus unavailable.

property variant_key: str

Get a readable representation of the variant’s coordinates or the large SV info.

property variant_class: VariantClass

Get a VariantClass category.

is_structural() bool[source]

Test if the variant is a structural variant.

This can either be because the variant coordinates use the structural variant notation (see VariantCoordinates.is_structural()) or if the variant is large imprecise SV.

class gpsea.model.VariantInfoAware[source]

Bases: object

An entity where VariantInfo is available.

abstract property variant_info: VariantInfo

Get the variant data with coordinates or other info available for large imprecise SVs.

class gpsea.model.Genotype(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Genotype represents state of a variable locus in a diploid genome.

NO_CALL = ('.',)
HOMOZYGOUS_REFERENCE = ('0/0',)
HETEROZYGOUS = ('0/1',)
HOMOZYGOUS_ALTERNATE = ('1/1',)
HEMIZYGOUS = ('1',)
property code: str
class gpsea.model.Genotypes(samples: Iterable[SampleLabels], genotypes: Iterable[Genotype])[source]

Bases: Sized, Iterable

Genotypes is a container for mapping between sample ID and its genotype.

Let’s consider a pair of samples:

>>> a = SampleLabels('A')
>>> b = SampleLabels('B')

We can use one of the static methods to create an instance. Either a single genotype:

>>> gt = Genotypes.single(a, Genotype.HETEROZYGOUS)

or genotypes of several samples:

>>> gts = Genotypes.from_mapping({a: Genotype.HETEROZYGOUS, b: Genotype.HOMOZYGOUS_ALTERNATE})

There are 2 genotypes in the container:

>>> len(gts)
2

You can get a genotype for a sample ID:

>>> g = gts.for_sample(a)
>>> g.code
'0/1'

You will get None if the sample is not present:

>>> gts.for_sample(SampleLabels('UNKNOWN'))

You can iterate over sample-genotype pairs:

>>> for sample_id, genotype in gts:
...   print(sample_id, genotype)
A 0/1
B 1/1
static empty()[source]
static single(sample_id: SampleLabels, genotype: Genotype)[source]

A shortcut for creating Genotypes for a single sample:

>>> a = SampleLabels('A')
>>> gts = Genotypes.single(a, Genotype.HOMOZYGOUS_ALTERNATE)
>>> assert len(gts) == 1
>>> assert gts.for_sample(a) == Genotype.HOMOZYGOUS_ALTERNATE
static from_mapping(mapping: Mapping[SampleLabels, Genotype])[source]

Create Genotypes from mapping between sample IDs and genotypes.

>>> a = SampleLabels('A')
>>> b = SampleLabels('B')
>>> gts = Genotypes.from_mapping({a: Genotype.HETEROZYGOUS, b: Genotype.HOMOZYGOUS_ALTERNATE})
>>> assert len(gts) == 2
for_sample(sample_id: SampleLabels) Genotype | None[source]

Get a genotype for a sample or None if the genotype is not present.

Parameters:

sample_id – a SampleLabels with sample’s identifier.

class gpsea.model.Genotyped[source]

Bases: object

Genotyped entities

abstract property genotypes: Genotypes
genotype_for_sample(sample_id: SampleLabels) Genotype | None[source]

Get a genotype for a sample or None if the genotype is not present.

Parameters:

sample_id – a SampleLabels with sample’s identifier.

class gpsea.model.TranscriptAnnotation(gene_id: str, tx_id: str, hgvs_cdna: str | None, is_preferred: bool, variant_effects: Iterable[VariantEffect], affected_exons: Iterable[int] | None, protein_id: str | None, hgvsp: str | None, protein_effect_coordinates: Region | None)[source]

Bases: TranscriptInfoAware

TranscriptAnnotation represent a result of the functional annotation of a variant with respect to single transcript of a gene.

property gene_id: str

Get the HGVS symbol of the affected gene (e.g. SURF1).

property transcript_id: str

Get the RefSeq identifier of the transcript (e.g. NM_123456.7).

property is_preferred: bool

Return True if the transcript is the preferred transcript of a gene, such as MANE transcript, canonical Ensembl transcript.

property hgvs_cdna: str | None

Get the HGVS description of the sequence variant (e.g. NM_123456.7:c.9876G>T) or None if not available.

property variant_effects: Sequence[VariantEffect]

Get a sequence of the predicted functional variant effects.

property overlapping_exons: Sequence[int] | None

Get a sequence of 1-based exon indices (the index of the 1st exon is 1) that overlap with the variant.

property protein_id: str | None

Get the ID of the protein encoded by the transcript_id or None if the transcript is not protein-coding.

property hgvsp: str | None

Get the HGVS description of the protein sequence variant (e.g. NP_001027559.1:p.Gln421Ter) or None if not available.

property protein_effect_location: Region | None

Get the Region with start and end amino-acid coordinates affected by the variant.

class gpsea.model.VariantEffect(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

VariantEffect represents consequences of a variant on transcript that are supported by GPSEA.

>>> from gpsea.model import VariantEffect
>>> missense = VariantEffect.MISSENSE_VARIANT
>>> print(missense)
missense_variant

The VariantEffect has a curie attribute that represents the ontology class from Sequence Ontology.

>>> missense.curie
'SO:0001583'
TRANSCRIPT_ABLATION = 'SO:0001893'
SPLICE_ACCEPTOR_VARIANT = 'SO:0001574'
SPLICE_DONOR_VARIANT = 'SO:0001575'
STOP_GAINED = 'SO:0001587'
FRAMESHIFT_VARIANT = 'SO:0001589'
STOP_LOST = 'SO:0001578'
START_LOST = 'SO:0002012'
TRANSCRIPT_AMPLIFICATION = 'SO:0001889'
INFRAME_INSERTION = 'SO:0001821'
INFRAME_DELETION = 'SO:0001822'
MISSENSE_VARIANT = 'SO:0001583'
PROTEIN_ALTERING_VARIANT = 'SO:0001818'
SPLICE_REGION_VARIANT = 'SO:0001630'
SPLICE_DONOR_5TH_BASE_VARIANT = 'SO:0001787'
SPLICE_DONOR_REGION_VARIANT = 'SO:0002170'
SPLICE_POLYPYRIMIDINE_TRACT_VARIANT = 'SO:0002169'
INCOMPLETE_TERMINAL_CODON_VARIANT = 'SO:0001626'
START_RETAINED_VARIANT = 'SO:0002019'
STOP_RETAINED_VARIANT = 'SO:0001567'
SYNONYMOUS_VARIANT = 'SO:0001819'
CODING_SEQUENCE_VARIANT = 'SO:0001580'
MATURE_MIRNA_VARIANT = 'SO:0001620'
FIVE_PRIME_UTR_VARIANT = 'SO:0001623'
THREE_PRIME_UTR_VARIANT = 'SO:0001624'
NON_CODING_TRANSCRIPT_EXON_VARIANT = 'SO:0001792'
INTRON_VARIANT = 'SO:0001627'
NMD_TRANSCRIPT_VARIANT = 'SO:0001621'
NON_CODING_TRANSCRIPT_VARIANT = 'SO:0001619'
UPSTREAM_GENE_VARIANT = 'SO:0001631'
DOWNSTREAM_GENE_VARIANT = 'SO:0001632'
TFBS_ABLATION = 'SO:0001895'
TFBS_AMPLIFICATION = 'SO:0001892'
TF_BINDING_SITE_VARIANT = 'SO:0001782'
REGULATORY_REGION_ABLATION = 'SO:0001894'
REGULATORY_REGION_AMPLIFICATION = 'SO:0001891'
FEATURE_ELONGATION = 'SO:0001907'
REGULATORY_REGION_VARIANT = 'SO:0001566'
FEATURE_TRUNCATION = 'SO:0001906'
INTERGENIC_VARIANT = 'SO:0001628'
SEQUENCE_VARIANT = 'SO:0001060'
to_display() str[source]

Get a concise name of the variant effect that is suitable for showing to humans.

Example

>>> from gpsea.model import VariantEffect
>>> VariantEffect.MISSENSE_VARIANT.to_display()
'missense'
>>> VariantEffect.SPLICE_DONOR_5TH_BASE_VARIANT.to_display()
'splice donor 5th base'
Returns:

a str with the name or ‘n/a’ if the variant effect was not assigned a concise name.

static structural_so_id_to_display(so_term: TermId | str) str[source]

Get a str with a concise name for representing a Sequence Ontology (SO) term identifier.

Example

>>> from gpsea.model import VariantEffect
>>> VariantEffect.structural_so_id_to_display('SO:1000029')
'chromosomal deletion'
Parameters:

so_term – a CURIE str or a TermId with the query SO term.

Returns:

a str with the concise name for the SO term or ‘n/a’ if a name has not been assigned yet.

property curie: str

Get a compact URI (CURIE) of the variant effect (e.g. SO:0001583 for a missense variant).

class gpsea.model.TranscriptInfoAware[source]

Bases: object

The implementors know about basic gene/transcript identifiers.

abstract property gene_id: str

Returns: string: The gene symbol (e.g. SURF1)

abstract property transcript_id: str

Returns: string: The transcript RefSeq identifier (e.g. NM_123456.7)

class gpsea.model.TranscriptCoordinates(identifier: str, region: GenomicRegion, exons: Iterable[GenomicRegion], cds_start: int | None, cds_end: int | None, is_preferred: bool | None = None)[source]

Bases: object

TranscriptCoordinates knows about genomic region of the transcript, exonic/intronic regions, as well as the coding and non-coding regions.

If both CDS coordinates are None, then the transcript coordinates are assumed to represent a non-coding transcript.

property identifier: str

Get transcript’s identifier (e.g. ENST00000123456.7, NM_123456.7).

property region: GenomicRegion

Get the genomic region spanned by the transcript, corresponding to 5’UTR, exonic, intronic, and 3’UTR regions.

property exons: Sequence[GenomicRegion]

Get the exon regions.

property cds_start: int | None

Get the 0-based (excluded) start coordinate of the first base of the start codon of the transcript or None if the transcript is not coding.

property cds_end: int | None

Get the 0-based (included) end coordinate of the last base of the termination codon of the transcript or None if the transcript is not coding.

is_coding() bool[source]

Return True if the transcript is coding/translated.

get_coding_base_count() int | None[source]

Calculate the number of coding bases present in the transcript. Note, the count does NOT include the termination codon since it does not code for an aminoacid. Returns: an int with the coding base count or None if the transcript is non-coding.

get_codon_count() int | None[source]

Calculate the count of codons present in the transcript. Note, the count does NOT include the termination codon!

Returns: the number of codons of the transcript or None if the transcript is non-coding.

get_five_prime_utrs() Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to 5’ untranslated regions of the transcript.

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

get_three_prime_utrs() Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to 3’ untranslated regions of the transcript.

Note, the termination codon is NOT included in the regions!

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

get_cds_regions() Sequence[GenomicRegion][source]

Get a sequence of genomic regions that correspond to coding regions of the transcript, including BOTH the initiation and termination codons.

Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.

property is_preferred: bool | None

Check if the transcript belongs among the preferred transcripts of the gene. This usually means that the transcript was chosen by the MANE project or similar.

Returns None if the info is not available.

class gpsea.model.ProteinMetadata(protein_id: str, label: str, protein_features: Sequence[ProteinFeature], protein_length: int = 0)[source]

Bases: object

An info regarding a protein sequence, including an ID, a label, and location of protein features, such as motifs, domains, or other regions.

property protein_id: str

Returns: string: A string unique to this protein

property label: str

Returns: string: The full name of the protein

property protein_features: Sequence[ProteinFeature]

Returns: Sequence[ProteinFeature]: A sequence of ProteinFeatures objects

property protein_length: int

Returns: int: length of protein

domains() Iterable[ProteinFeature][source]
Returns:

A subgroup of protein_features, where the ProteinFeature object has a FeatureType equal to “DOMAIN”

Return type:

Iterable[ProteinFeature]

repeats() Iterable[ProteinFeature][source]
Returns:

A subgroup of protein_features, where the ProteinFeature object has a FeatureType equal to “REPEAT”

Return type:

Iterable[ProteinFeature]

regions() Iterable[ProteinFeature][source]
Returns:

A subgroup of protein_features, where the ProteinFeature object has a FeatureType equal to “REGIONS”

Return type:

Iterable[ProteinFeature]

motifs() Iterable[ProteinFeature][source]
Returns:

A subgroup of protein_features, where the ProteinFeature object has a FeatureType equal to “MOTIF”

Return type:

Iterable[ProteinFeature]

get_features_variant_overlaps(region: Region) Collection[ProteinFeature][source]

Get a collection of protein features that overlap with the region. :param region: the query region.

Returns:

a collection of overlapping protein features.

Return type:

Collection[ProteinFeature]

class gpsea.model.ProteinFeature[source]

Bases: object

static create(info: FeatureInfo, feature_type: FeatureType)[source]
abstract property info: FeatureInfo
abstract property feature_type: FeatureType
to_string() str[source]
class gpsea.model.FeatureInfo(name: str, region: Region)[source]

Bases: object

FeatureInfo represents a protein feature (e.g. a repeated sequence given the name “ANK 1” in protein “Ankyrin repeat domain-containing protein 11”)

property name: str

Returns: string: the name of the protein feature

property start: int

Returns: integer: A 0-based (excluded) start coordinate of the protein feature.

property end: int

Returns: integer: A 0-based (included) end coordinate of the protein feature.

property region: Region

Returns: Region: a protein region spanned by the feature.

class gpsea.model.FeatureType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

An enum representing the protein feature types supported in GPSEA.

REPEAT = 1

A repeated sequence motif or repeated domain within the protein.

MOTIF = 2

A short (usually not more than 20 amino acids) conserved sequence motif of biological significance.

DOMAIN = 3

A specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.

REGION = 4

A region of interest that cannot be described in other subsections.

Subpackages