gpsea.model package
The gpsea.model package defines data model classes used in GPSEA.
We start with the top-level elements, such as Cohort
and Patient
,
and we follow with data classes for phenotype, genotype, transcript, and protein info.
- class gpsea.model.Cohort(members: Iterable[Patient], excluded_member_count: int)[source]
Bases:
Sized
,Iterable
[Patient
]Cohort is a collection of individuals that have been preprocessed and are ready for genotype-phenotype association analysis.
- static from_patients(members: Iterable[Patient], include_patients_with_no_HPO: bool = False, include_patients_with_no_variants: bool = False)[source]
Create a cohort from a sequence of patients.
- property all_patients: Collection[Patient]
Get a collection of all patients in the cohort.
- all_phenotypes() Set[Phenotype] [source]
Get a set of all phenotypes (observed or excluded) in the cohort members.
- count_distinct_hpo_terms() int [source]
Get count of distinct HPO terms (either in present or excluded state) seen in the cohort members.
- all_measurements() Set[Measurement] [source]
Get a set of all phenotypes (observed or excluded) in the cohort members.
- count_distinct_measurements() int [source]
Get count of distinct measurements in the cohort members.
- all_diseases() Set[Disease] [source]
Get a set of all diseases (observed or excluded) in the cohort members.
- count_distinct_diseases() int [source]
Get count of distinct disease diagnoses of the cohort members.
- all_variant_infos() Set[VariantInfo] [source]
Get a set of variant infos observed in the cohort members.
- property all_transcript_ids: Set[str]
Get a set of all transcript IDs affected by the cohort variants.
- property total_patient_count
Get the total number of cohort members.
- list_present_phenotypes(top: int | None = None) Sequence[Tuple[str, int]] [source]
Get a sequence with counts of HPO terms used as direct annotations of the cohort members.
- list_measurements(top: int | None = None) Sequence[Tuple[str, int]] [source]
Get a sequence with counts of measurements of the cohort members.
- Parameters:
typing.Optional[int] – If not given, lists all measurements. Otherwise, lists only the top highest counts
- Returns:
- A sequence of tuples,
formatted (
Measurement
, number of individuals with the measurement)
- Return type:
- list_all_diseases(top: int | None = None) Sequence[Tuple[str, int]] [source]
Get a sequence with counts of disease of the cohort members.
- Parameters:
typing.Optional[int] – If not given, lists all diseases. Otherwise, lists only the top highest counts
- Returns:
- A sequence of tuples,
formatted (
Disease
, number of individuals with the diagnosis)
- Return type:
- list_all_variants(top: int | None = None) Sequence[Tuple[str, int]] [source]
Get the number of times a variant has been observed in cohort individuals.
Note, this is not the same as the number of variant alleles in the cohort individuals! This method counts each variant once, regardless of its genotype!
- Parameters:
typing.Optional[int] – If not given, lists all variants. Otherwise, lists only the top highest counts
- Returns:
A sequence of tuples, formatted (variant key, number of patients with that variant)
- Return type:
- list_all_proteins(top=None) Sequence[Tuple[str, int]] [source]
- Parameters:
typing.Optional[int] – If not given, lists all proteins. Otherwise, lists only the top highest counts.
- Returns:
A list of tuples, formatted (protein ID string, the count of variants that affect the protein)
- Return type:
- variant_effect_count_by_tx(tx_id: str | None = None) Mapping[str, Mapping[str, int]] [source]
Count variant effects for all transcripts or for a transcript tx_id of choice.
- Parameters:
tx_id – a str with transcript accession (e.g. NM_123456.5) or None if all transcripts should be listed.
- Returns:
- Each transcript ID references a Counter(),
with the variant effect as the key and the count of variants with that effect on the transcript id.
- Return type:
- count_alive() int [source]
Get the number of individuals reported to be alive at the time of last encounter.
- class gpsea.model.Patient(labels: SampleLabels, sex: Sex, age: Age | None, vital_status: VitalStatus | None, phenotypes: Iterable[Phenotype], measurements: Iterable[Measurement], diseases: Iterable[Disease], variants: Iterable[Variant])[source]
Bases:
object
Patient represents a single investigated individual.
We need to know about the following attributes:
identifier(s) formatted as
SampleLabels
age of last clinical encounter (optional) formatted as
Age
or None if not available.vital status (optional) formatted as
VitalStatus
, which reports if the individual is alive or deceased plus (optional) age of deathHPO terms to represent the phenotype information, each HPO formatted as an instance of
Phenotype
numerical measurements
disease diagnoses formatted as
Disease
genotype information as one or more
Variant
Note
We strongly recommend using the
from_raw_parts()
static constructor instead of __init__.- static from_raw_parts(labels: str | SampleLabels, sex: Sex | None = None, age: Age | None = None, vital_status: VitalStatus | None = None, phenotypes: Iterable[Phenotype] = (), measurements: Iterable[Measurement] = (), diseases: Iterable[Disease] = (), variants: Iterable[Variant] = ()) Patient [source]
Create Patient from the primary data.
- property labels: SampleLabels
Get the sample identifiers.
- property vital_status: VitalStatus | None
Get the vital status information for the individual or None if not available.
- phenotype_by_id(term_id: str | TermId) Phenotype | None [source]
Get a phenotype with an identifier or None if the individual has no such phenotype.
- count_unique_phenotypes() int [source]
Get the count of unique HPO terms (present or excluded) in this individual.
- property measurements: Sequence[Measurement]
Get the measurements in the patient.
- measurement_by_id(term_id: str | TermId) Measurement | None [source]
Get a measurement with an identifier or None if the individual has no such measurement.
- Parameters:
term_id – a str with CURIE or a
TermId
representing the term ID of a measurement (e.g. LOINC:2986-8 for Testosterone[Mass/Vol]).- Returns:
the corresponding
Measurement
or None if not found in the patient.
- disease_by_id(term_id: str | TermId) Disease | None [source]
Get a disease with an identifier or None if the individual has no such disease.
- present_phenotypes() Iterator[Phenotype] [source]
Get an iterator over the present phenotypes of the patient.
- excluded_phenotypes() Iterator[Phenotype] [source]
Get an iterator over the excluded phenotypes of the patient.
- class gpsea.model.SampleLabels(label: str, meta_label: str | None = None)[source]
Bases:
object
A data model for subject identifiers.
The subject has a mandatory
label
and an optionalmeta_label
.The identifiers support natural ordering, equality tests, and are hashable.
- class gpsea.model.Sex(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
Sex represents typical “phenotypic sex”, as would be determined by a midwife or physician at birth.
The definition is aligned with Phenopacket Schema
- UNKNOWN_SEX = 0
Not assessed or not available. Maps to
NCIT:C17998
.
- FEMALE = 1
Female sex. Maps to
NCIT:C46113
.
- MALE = 2
Male sex. Maps to
NCIT:C46112
.
- class gpsea.model.VitalStatus(status: gpsea.model._cohort.Status, age_of_death: gpsea.model._temporal.Age | None)[source]
Bases:
object
- class gpsea.model.Status(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
- UNKNOWN = 0
- ALIVE = 1
- DECEASED = 2
- class gpsea.model.Age(days: float, timeline: Timeline)[source]
Bases:
object
Representation of an age of an individual.
In GPSEA, we model the age at the resolution of a day. The age is either gestational or postnatal. A gestational age is created from the number of weeks and days since the last menstrual period and the postnatal age is created from years, months and days since birth.
Age overloads the comparison operators and can, thus, be compared or sorted. Gestational age is always before the postnatal age.
Internally, the age is always stored as the number of days.
- ISO8601PT = re.compile('^P(?P<year>\\d+Y)?(?P<month>\\d+M)?(?P<week>\\d+W)?(?P<day>\\d+D)?(T(\\d+H)?(\\d+M)?(\\d+S)?)?$')
- DAYS_IN_YEAR = 365.25
- DAYS_IN_MONTH = 30.4375
- DAYS_IN_WEEK = 7
- static gestational(weeks: int, days: int = 0) Age [source]
Create age from the weeks and days on the gestational timeline.
- Parameters:
weeks – a non-negative int with the number of completed weeks of gestation.
days – an int in range \([0, 6]\) representing the number of completed gestational days.
- static gestational_days(days: int | float) Age [source]
Create a gestational age corresponding to the number of days since the last menstrual period.
- static postnatal_days(days: int | float) Age [source]
Create a postnatal age corresponding to the number of days since birth.
- static from_iso8601_period(value: str) Age [source]
Create Age from ISO8601 duration.
A value with weeks or days is parsed into a gestational age, while a value with years, months or days is parsed into a postnatal age. An error is raised if a value for weeks and months (or years) is included at the same time.
Examples
Parse gestational age:
>>> from gpsea.model import Age >>> gestational = Age.from_iso8601_period("P2W6D") >>> gestational Age(days=20.0, timeline=Timeline.GESTATIONAL)
Parse postnatal age:
>>> postnatal = Age.from_iso8601_period("P10Y") >>> postnatal Age(days=3652.5, timeline=Timeline.POSTNATAL)
- Parameters:
value – a str with the duration (e.g. P22W3D for a gestational age or P10Y4M2D for a postnatal age).
- class gpsea.model.Timeline(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
Timeline represents the stage of temporal development of an organism.
There are two stages: gestational and postnatal.
Gestational timeline starts at the last menstrual period and ends at (but does not include) birth. The postnatal timeline starts at birth and ends at death.
The Timeline overloads comparison operators. Gestational timeline is always before postnatal timeline.
- GESTATIONAL = 1
Gestational timeline starts at the last menstrual period and ends at birth (excluded).
- POSTNATAL = 2
Postnatal timeline starts at birth and ends at death.
- class gpsea.model.Phenotype(term_id: TermId, is_observed: bool, onset: Age | None)[source]
Bases:
Identified
,ObservableFeature
,OnsetAware
Phenotype represents a clinical sign or symptom represented as an HPO term.
The phenotype can be either present in the patient or excluded.
- static from_term(term: MinimalTerm, is_observed: bool)[source]
- static from_raw_parts(term_id: str | TermId, is_observed: bool, onset: Age | None = None) Phenotype [source]
Create Phenotype from a term ID and observation state.
- property is_present: bool
Return True if the phenotype feature was observed in the subject or False if the feature’s presence was explicitly excluded.
- property observed: bool | None
Returns a boolean for whether the phenotype is observed.
- Returns:
True if this phenotype was observed in the respective patient.
- Return type:
- class gpsea.model.Disease(term_id: TermId, name: str, is_observed: bool, onset: Age | None)[source]
Bases:
Identified
,ObservableFeature
,Named
,OnsetAware
Representation of a disease diagnosed (or excluded) in an investigated individual.
- static from_raw_parts(term_id: str | TermId, name: str, is_observed: bool, onset: Age | None = None) Disease [source]
- property name
Get the disease label (e.g. LEIGH SYNDROME, NUCLEAR; NULS).
- class gpsea.model.Measurement(test_term_id: TermId, test_name: str, test_result: float, unit: TermId)[source]
Bases:
Identified
,Named
Representation of a GA4GH Phenopacket Measurement (numerical test result). An intended use case would be to perform a Student’s t test on numerical measurements in individuals with two difference genotype classes.
- property name
Get the test label (e.g. Testosterone [Mass/volume] in Serum or Plasma for
LOINC 2986-8
).
- class gpsea.model.OnsetAware[source]
Bases:
object
A mixin for classes which may know about their
Age
or onset.For instance, the onset of a phenotype or a disease in an individual.
- class gpsea.model.Variant(variant_info: VariantInfo, tx_annotations: Iterable[TranscriptAnnotation], genotypes: Genotypes)[source]
Bases:
VariantInfoAware
,FunctionalAnnotationAware
,Genotyped
- Variant includes three lines of information:
the variant data with coordinates or other info available for large imprecise SVs,
results of the functional annotation with respect to relevant transcripts, and
the genotypes for the known samples
- static create_variant_from_scratch(variant_coordinates: VariantCoordinates, gene_name: str, trans_id: str, hgvs_cdna: str, is_preferred: bool, consequences: Iterable[VariantEffect], exons_effected: Sequence[int], protein_id: str | None, hgvsp: str | None, protein_effect_start: int | None, protein_effect_end: int | None, genotypes: Genotypes)[source]
- property variant_info: VariantInfo
Get the representation of the variant data for sequence and symbolic variants, as well as for large imprecise SVs.
- property tx_annotations: Sequence[TranscriptAnnotation]
A collection of TranscriptAnnotations that each represent results of the functional annotation of a variant with respect to single transcript of a gene.
- Returns:
A sequence of TranscriptAnnotation objects
- Return type:
Sequence[TranscriptAnnotation]
- class gpsea.model.VariantClass(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
VariantClass represents a high-level variant category which mostly corresponds to the structural variant categories of the Variant Call Format specification, but includes type for single nucleotide variants (SNV) and multi-nucleotide variant (MNV).
- DEL = 0
A deletion - a variant with a net loss of sequence from the alternative allele regardless of its size.
Both a deletion of 1 base pair and a deletion of 1000 base pairs are acceptable.
- DUP = 1
Duplication (tandem or interspersed).
- INS = 2
Insertion of a novel sequence.
- INV = 3
Inversion of a chromosome segment.
- MNV = 4
Multi-nucleotide variant (e.g. AG>CT) that is not a duplication, deletion, or insertion. May be called INDEL.
- SNV = 5
Single nucleotide variant.
- TRANSLOCATION = 6
A chromosomal translocation, which occurs when a chromosome breaks and the (typically two) fragmented pieces re-attach to different chromosomes.
- class gpsea.model.VariantCoordinates(region: GenomicRegion, ref: str, alt: str, change_length: int)[source]
Bases:
object
A representation of coordinates of sequence and symbolic variants.
Note, the breakend variants are not currently supported.
- static from_vcf_literal(contig: Contig, pos: int, ref: str, alt: str)[source]
Create VariantCoordinates from a variant in VCF literal notation.
Note, this function must not be used to create a VCF symbolic variant (e.g. <DEL> or translocation). Use
from_vcf_symbolic()
instead.Example
Create a variant from a VCF line:
` #CHROM POS ID REF ALT ... chr1 1001 . C T `
We first must decide on the genome build. Most of the time, we should use GRCh38:
>>> from gpsea.model.genome import GRCh38 >>> build = GRCh38
Then, we access the contig for
'chr1'
:>>> chr1 = build.contig_by_name('chr1')
Last, we create the variant coordinates:
>>> from gpsea.model import VariantCoordinates >>> vc = VariantCoordinates.from_vcf_literal( ... contig=chr1, pos=1001, ref='C', alt='T', ... )
Now can test the properties:
>>> vc.start, vc.end, vc.ref, vc.alt, len(vc), vc.change_length (1000, 1001, 'C', 'T', 1, 0)
- Parameters:
contig – a
Contig
for the chromosomepos – a 1-based coordinate of the first base of the reference allele, as described in VCF standard
ref – a str with the REF allele. Should meet the requirements of the VCF standard.
alt – a str with the ALT allele. Should meet the requirements of the VCF standard.
- static from_vcf_symbolic(contig: Contig, pos: int, end: int, ref: str, alt: str, svlen: int)[source]
Create VariantCoordinates from a variant in VCF symbolic notation.
Note, this function must not be used to create a VCF sequence/literal variant. Use
from_vcf_literal()
instead.Example
Let’s create a symbolic variant from the line:
` #CHROM POS ID REF ALT QUAL FILTER INFO 2 321682 . T <DEL> 6 PASS SVTYPE=DEL;END=321887;SVLEN=-205 `
We first must decide on the genome build. Most of the time, we should use GRCh38:
>>> from gpsea.model.genome import GRCh38 >>> contig = GRCh38.contig_by_name('2')
Now, we create the coordinates as:
>>> vc = VariantCoordinates.from_vcf_symbolic( ... contig=contig, pos=321682, end=321887, ... ref='T', alt='<DEL>', svlen=-205, ... )
Now can test the properties:
>>> vc.start, vc.end, vc.ref, vc.alt, len(vc), vc.change_length (321681, 321887, 'T', '<DEL>', 206, -205)
- Parameters:
contig – a
Contig
for the chromosomepos – a 1-based coordinate of the first base of the affected reference allele region
end – a 1-based coordinate of the last base of the affected reference allele region
ref – a str with the REF allele. Most of the time, it is one of {‘N’, ‘A’, ‘C’, ‘G’, ‘T’}
alt – a str with the ALT allele, e.g. one of {‘<DEL>’, ‘<DUP>’, ‘<INS>’, ‘<INV>’}
svlen – an int with change length (the difference between ref and alt allele lengths)
- property start: int
Get the 0-based start coordinate (excluded) of the first base of the
ref
allele.
- property region: GenomicRegion
Get the genomic region spanned by the
ref
allele.
- property ref: str
Get the reference allele (e.g. “A”, “CCT”, “N”). The allele may be an empty string.
- property alt: str
Get the alternate allele (e.g. “A”, “GG”, “<DEL>”).
The allele may be an empty string for sequence variants. The symbolic alternate allele follow the VCF notation and use the < and > characters (e.g. “<DEL>”, “<INS:ME:SINE>”).
- property change_length: int
Get the change of length between the ref and alt alleles due to the variant presence.
See Change length of an allele for more info.
- property variant_key: str
Get a readable representation of the variant’s coordinates.
For instance,
X_12345_12345_C_G
for a sequence variant or22_10001_20000_INV
for a symbolic variant. If the key is larger than 50 characters, the ‘ref’ and/or ‘alt’ (if over 10 bps) are changed to just show number of bps. Example:X_1000001_1000027_TAAAAAAAAAAAAAAAAAAAAAAAAAA_T
->X_1000001_1000027_--27bp--_T
Note
Both start and end coordinates use 1-based (included) coordinate system.
- property variant_class: VariantClass
Get a
VariantClass
category.
- is_structural() bool [source]
Checks if the variant coordinates use structural variant notation as described by Variant Call Format (VCF).
Ane example of structural variant notation:
chr5 101 . N <DEL> . . SVTYPE=DEL;END=120;SVLEN=-10
as opposed to the sequence (literal) notation:
chr5 101 . NACGTACGTAC N
- Returns:
True if the variant coordinates use structural variant notation.
- class gpsea.model.ImpreciseSvInfo(structural_type: TermId, variant_class: VariantClass, gene_id: str, gene_symbol: str)[source]
Bases:
object
Data regarding a structural variant (SV) with imprecise breakpoint coordinates.
- property structural_type: TermId
Get term ID of the structural type (e.g.
SO:1000029
for chromosomal deletion).
- property variant_class: VariantClass
Get a
VariantClass
category.
- property gene_id: str
Get a str with gene identifier CURIE (e.g.
HGNC:3603
) or None if the identifier is not available.
- class gpsea.model.VariantInfo(variant_coordinates: VariantCoordinates | None = None, sv_info: ImpreciseSvInfo | None = None)[source]
Bases:
object
VariantInfo consists of either variant coordinates or imprecise SV data.
The class is conceptually similar to Rust enum - only one of the fields can be set at any point in time.
- property variant_coordinates: VariantCoordinates | None
Get variant coordinates if available.
- property sv_info: ImpreciseSvInfo | None
Get information about large imprecise SV.
- has_sv_info() bool [source]
Returns True if the variant is a large imprecise SV and the exact coordinates are thus unavailable.
- property variant_key: str
Get a readable representation of the variant’s coordinates or the large SV info.
- property variant_class: VariantClass
Get a
VariantClass
category.
- is_structural() bool [source]
Test if the variant is a structural variant.
This can either be because the variant coordinates use the structural variant notation (see
VariantCoordinates.is_structural()
) or if the variant is large imprecise SV.
- class gpsea.model.VariantInfoAware[source]
Bases:
object
An entity where
VariantInfo
is available.- abstract property variant_info: VariantInfo
Get the variant data with coordinates or other info available for large imprecise SVs.
- class gpsea.model.Genotype(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
Genotype represents state of a variable locus in a diploid genome.
- NO_CALL = ('.',)
- HOMOZYGOUS_REFERENCE = ('0/0',)
- HETEROZYGOUS = ('0/1',)
- HOMOZYGOUS_ALTERNATE = ('1/1',)
- HEMIZYGOUS = ('1',)
- class gpsea.model.Genotypes(samples: Iterable[SampleLabels], genotypes: Iterable[Genotype])[source]
-
Genotypes is a container for mapping between sample ID and its genotype.
Let’s consider a pair of samples:
>>> a = SampleLabels('A') >>> b = SampleLabels('B')
We can use one of the static methods to create an instance. Either a single genotype:
>>> gt = Genotypes.single(a, Genotype.HETEROZYGOUS)
or genotypes of several samples:
>>> gts = Genotypes.from_mapping({a: Genotype.HETEROZYGOUS, b: Genotype.HOMOZYGOUS_ALTERNATE})
There are 2 genotypes in the container:
>>> len(gts) 2
You can get a genotype for a sample ID:
>>> g = gts.for_sample(a) >>> g.code '0/1'
You will get None if the sample is not present:
>>> gts.for_sample(SampleLabels('UNKNOWN'))
You can iterate over sample-genotype pairs:
>>> for sample_id, genotype in gts: ... print(sample_id, genotype) A 0/1 B 1/1
- static single(sample_id: SampleLabels, genotype: Genotype) Genotypes [source]
A shortcut for creating Genotypes for a single sample:
>>> a = SampleLabels('A') >>> gts = Genotypes.single(a, Genotype.HOMOZYGOUS_ALTERNATE)
>>> assert len(gts) == 1 >>> assert gts.for_sample(a) == Genotype.HOMOZYGOUS_ALTERNATE
- static from_mapping(mapping: Mapping[SampleLabels, Genotype]) Genotypes [source]
Create Genotypes from mapping between sample IDs and genotypes.
>>> a = SampleLabels('A') >>> b = SampleLabels('B') >>> gts = Genotypes.from_mapping({a: Genotype.HETEROZYGOUS, b: Genotype.HOMOZYGOUS_ALTERNATE})
>>> assert len(gts) == 2
- for_sample(sample_id: SampleLabels) Genotype | None [source]
Get a genotype for a sample or None if the genotype is not present.
- Parameters:
sample_id – a
SampleLabels
with sample’s identifier.
- class gpsea.model.Genotyped[source]
Bases:
object
Genotyped entities
- genotype_for_sample(sample_id: SampleLabels) Genotype | None [source]
Get a genotype for a sample or None if the genotype is not present.
- Parameters:
sample_id – a
SampleLabels
with sample’s identifier.
- class gpsea.model.TranscriptAnnotation(gene_id: str, tx_id: str, hgvs_cdna: str | None, is_preferred: bool, variant_effects: Iterable[VariantEffect], affected_exons: Iterable[int] | None, protein_id: str | None, hgvsp: str | None, protein_effect_coordinates: Region | None)[source]
Bases:
TranscriptInfoAware
TranscriptAnnotation represent a result of the functional annotation of a variant with respect to single transcript of a gene.
- property is_preferred: bool
Return True if the transcript is the preferred transcript of a gene, such as MANE transcript, canonical Ensembl transcript.
- property hgvs_cdna: str | None
Get the HGVS description of the sequence variant (e.g.
NM_123456.7:c.9876G>T
) or None if not available.
- property variant_effects: Sequence[VariantEffect]
Get a sequence of the predicted functional variant effects.
- property overlapping_exons: Sequence[int] | None
Get a sequence of 1-based exon indices (the index of the 1st exon is 1) that overlap with the variant.
- property protein_id: str | None
Get the ID of the protein encoded by the
transcript_id
or None if the transcript is not protein-coding.
- class gpsea.model.VariantEffect(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
VariantEffect represents consequences of a variant on transcript that are supported by GPSEA.
>>> from gpsea.model import VariantEffect >>> missense = VariantEffect.MISSENSE_VARIANT >>> print(missense) missense_variant
The VariantEffect has a
curie
attribute that represents the ontology class from Sequence Ontology.>>> missense.curie 'SO:0001583'
- TRANSCRIPT_ABLATION = 'SO:0001893'
- SPLICE_ACCEPTOR_VARIANT = 'SO:0001574'
- SPLICE_DONOR_VARIANT = 'SO:0001575'
- STOP_GAINED = 'SO:0001587'
- FRAMESHIFT_VARIANT = 'SO:0001589'
- STOP_LOST = 'SO:0001578'
- START_LOST = 'SO:0002012'
- TRANSCRIPT_AMPLIFICATION = 'SO:0001889'
- INFRAME_INSERTION = 'SO:0001821'
- INFRAME_DELETION = 'SO:0001822'
- MISSENSE_VARIANT = 'SO:0001583'
- PROTEIN_ALTERING_VARIANT = 'SO:0001818'
- SPLICE_REGION_VARIANT = 'SO:0001630'
- SPLICE_DONOR_5TH_BASE_VARIANT = 'SO:0001787'
- SPLICE_DONOR_REGION_VARIANT = 'SO:0002170'
- SPLICE_POLYPYRIMIDINE_TRACT_VARIANT = 'SO:0002169'
- INCOMPLETE_TERMINAL_CODON_VARIANT = 'SO:0001626'
- START_RETAINED_VARIANT = 'SO:0002019'
- STOP_RETAINED_VARIANT = 'SO:0001567'
- SYNONYMOUS_VARIANT = 'SO:0001819'
- CODING_SEQUENCE_VARIANT = 'SO:0001580'
- MATURE_MIRNA_VARIANT = 'SO:0001620'
- FIVE_PRIME_UTR_VARIANT = 'SO:0001623'
- THREE_PRIME_UTR_VARIANT = 'SO:0001624'
- NON_CODING_TRANSCRIPT_EXON_VARIANT = 'SO:0001792'
- INTRON_VARIANT = 'SO:0001627'
- NMD_TRANSCRIPT_VARIANT = 'SO:0001621'
- NON_CODING_TRANSCRIPT_VARIANT = 'SO:0001619'
- UPSTREAM_GENE_VARIANT = 'SO:0001631'
- DOWNSTREAM_GENE_VARIANT = 'SO:0001632'
- TFBS_ABLATION = 'SO:0001895'
- TFBS_AMPLIFICATION = 'SO:0001892'
- TF_BINDING_SITE_VARIANT = 'SO:0001782'
- REGULATORY_REGION_ABLATION = 'SO:0001894'
- REGULATORY_REGION_AMPLIFICATION = 'SO:0001891'
- FEATURE_ELONGATION = 'SO:0001907'
- REGULATORY_REGION_VARIANT = 'SO:0001566'
- FEATURE_TRUNCATION = 'SO:0001906'
- INTERGENIC_VARIANT = 'SO:0001628'
- SEQUENCE_VARIANT = 'SO:0001060'
- to_display() str [source]
Get a concise name of the variant effect that is suitable for showing to humans.
Example
>>> from gpsea.model import VariantEffect >>> VariantEffect.MISSENSE_VARIANT.to_display() 'missense' >>> VariantEffect.SPLICE_DONOR_5TH_BASE_VARIANT.to_display() 'splice donor 5th base'
- Returns:
a str with the name or ‘n/a’ if the variant effect was not assigned a concise name.
- static structural_so_id_to_display(so_term: TermId | str) str [source]
Get a str with a concise name for representing a Sequence Ontology (SO) term identifier.
Example
>>> from gpsea.model import VariantEffect >>> VariantEffect.structural_so_id_to_display('SO:1000029') 'chromosomal deletion'
- Parameters:
so_term – a CURIE str or a
TermId
with the query SO term.- Returns:
a str with the concise name for the SO term or ‘n/a’ if a name has not been assigned yet.
- class gpsea.model.TranscriptInfoAware[source]
Bases:
object
The implementors know about basic gene/transcript identifiers.
- class gpsea.model.FunctionalAnnotationAware[source]
Bases:
object
- abstract property tx_annotations: Sequence[TranscriptAnnotation]
- get_tx_anno_by_tx_id(transcript_id: str) TranscriptAnnotation | None [source]
Given a transcript ID, this will return the TranscriptAnnotation associated with that variant and transcript.
- Parameters:
transcript_id (str) – A transcript ID - i.e. ‘NM_170707.4’
- Returns:
The Transcript Annotation if available.
- Return type:
- get_hgvs_cdna_by_tx_id(transcript_id: str) str | None [source]
Given a transcript ID, will return the hgvs cdna string associated with that variant and transcript.
- get_preferred_tx_annotation() TranscriptAnnotation | None [source]
Get the TranscriptAnnotation that represents the result of the functional annotation with respect to the preferred transcript of a gene.
Returns None if transcript annotations is no preferred transcript found.
- Returns:
- The TranscriptAnnotation with respect
to the preferred transcript or None if the preferred transcript info is not available.
- Return type:
- class gpsea.model.TranscriptCoordinates(identifier: str, region: GenomicRegion, exons: Iterable[GenomicRegion], cds_start: int | None, cds_end: int | None, is_preferred: bool | None = None)[source]
Bases:
object
TranscriptCoordinates knows about genomic region of the transcript, exonic/intronic regions, as well as the coding and non-coding regions.
If both CDS coordinates are None, then the transcript coordinates are assumed to represent a non-coding transcript.
- property region: GenomicRegion
Get the genomic region spanned by the transcript, corresponding to 5’UTR, exonic, intronic, and 3’UTR regions.
- property exons: Sequence[GenomicRegion]
Get the exon regions.
- property cds_start: int | None
Get the 0-based (excluded) start coordinate of the first base of the start codon of the transcript or None if the transcript is not coding.
- property cds_end: int | None
Get the 0-based (included) end coordinate of the last base of the termination codon of the transcript or None if the transcript is not coding.
- get_coding_base_count() int | None [source]
Calculate the number of coding bases present in the transcript. Note, the count does NOT include the termination codon since it does not code for an aminoacid. Returns: an int with the coding base count or None if the transcript is non-coding.
- get_codon_count() int | None [source]
Calculate the count of codons present in the transcript. Note, the count does NOT include the termination codon!
Returns: the number of codons of the transcript or None if the transcript is non-coding.
- get_five_prime_utrs() Sequence[GenomicRegion] [source]
Get a sequence of genomic regions that correspond to 5’ untranslated regions of the transcript.
Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.
- get_three_prime_utrs() Sequence[GenomicRegion] [source]
Get a sequence of genomic regions that correspond to 3’ untranslated regions of the transcript.
Note, the termination codon is NOT included in the regions!
Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.
- get_cds_regions() Sequence[GenomicRegion] [source]
Get a sequence of genomic regions that correspond to coding regions of the transcript, including BOTH the initiation and termination codons.
Returns: a sequence of genomic regions, an empty sequence if the transcript is non-coding.
- class gpsea.model.ProteinMetadata(protein_id: str, label: str, protein_features: Iterable[ProteinFeature], protein_length: int)[source]
Bases:
object
An info regarding a protein sequence, including an ID, a label, and location of protein features, such as motifs, domains, or other regions.
The information is usually retrieved from a resource such as
UniprotProteinMetadataService
, but it can also be created manually usingfrom_feature_frame()
function.Example
Let’s create a protein info with a domain and a region. We must provide protein accession ID, a label, a data frame with protein features, and the number of aminoacids of the protein sequence:
>>> protein_id = 'NP_000129.3' >>> label = 'fibrillin-1 isoform a preproprotein' >>> protein_length = 1000
Now let’s prepare a data frame with the protein features. We will prepare a domain and a region:
>>> import pandas as pd >>> features = [ ... { ... "region": "Suppresor domain", ... "category": "domain", ... "start": 1, ... "end": 223, ... }, ... { ... "region": "IP3 binding", ... "category": "region", ... "start": 224, ... "end": 578, ... }, ... ] >>> df = pd.DataFrame(features)
last, we can put the protein info together:
>>> from gpsea.model import ProteinMetadata >>> protein_meta = ProteinMetadata.from_feature_frame( ... protein_id=protein_id, ... label=label, ... features=df, ... protein_length=protein_length, ... )
and get the expected protein info:
>>> protein_meta.protein_id 'NP_000129.3' >>> protein_meta.label 'fibrillin-1 isoform a preproprotein' >>> len(protein_meta.protein_features) 2
- static from_feature_frame(protein_id: str, label: str, features: DataFrame, protein_length: int) ProteinMetadata [source]
Create ProteinMetadata from a user-supplied pandas DataFrame. We expect to obtain the gene symbol, protein identifier, and regions
The DataFrame should include the following columns:
region
category
start | end
Suppresor domain
domain
1
223
IP3 binding
region
224
578
The region column includes the protein feature name. The category is a str representing region’s feature type. Use ‘region’ if no better type fits. Last, start and end denote 1-based start and end coordinates of the aminoacid sequence region described by the feature. For instance, [1, 10] for the first ten aminoacids of the protein.
- Parameters:
protein_id – the accession id of the protein, e.g. ‘NP_000129.3’.
label – human-readable label, e.g. ‘fibrillin-1 isoform a preproprotein’.
features – a dataframe with of the protein features.
protein_length – a positive int representing the number of aminoacids included in the protein sequence.
- Raises:
ValueError – if case of issues during parsing the provided data.
- static from_uniprot_json(protein_id: str, label: str, uniprot_json: IOBase | str, protein_length: int) ProteinMetadata [source]
Create ProteinMetadata from a json file that has been downloaded from UniProt.
Go to the UniProt page for the protein of interest, then go to the section “Family & Domains”, and the subsection “Features”. Click on the Download symbol. You will be presented with a JSON file for download. From this, we extract information about the gene symbol, protein identifier, and regions. This method is intended to be a backup if the API call to UniProt fails; the same information should be retrieved. See the test file “test_uniprot_json.py” for details about the JSON parsing etc.
- Parameters:
protein_id – the accession id of the protein, e.g. ‘NP_000129.3’.
label – human-readable label, e.g. ‘fibrillin-1 isoform a preproprotein’.
uniprot_json – a str with the path or an IO object with the Uniprot JSON data.
protein_length – a positive int representing the number of aminoacids included in the protein sequence.
- Raises:
ValueError – if case of issues during parsing the provided data.
- property protein_features: Sequence[ProteinFeature]
Get a sequence of protein features.
- domains() Iterable[ProteinFeature] [source]
- Returns:
A subgroup of the protein features that correspond to protein domains.
- Return type:
Iterable[ProteinFeature]
- repeats() Iterable[ProteinFeature] [source]
- Returns:
A subgroup of the protein features that correspond to repeat regions.
- Return type:
Iterable[ProteinFeature]
- regions() Iterable[ProteinFeature] [source]
- Returns:
A subgroup of the protein features that correspond to generic regions.
- Return type:
Iterable[ProteinFeature]
- motifs() Iterable[ProteinFeature] [source]
- Returns:
A subgroup of the protein features that correspond to motifs.
- Return type:
Iterable[ProteinFeature]
- get_features_variant_overlaps(region: Region) Collection[ProteinFeature] [source]
Get a collection of protein features that overlap with the region. :param region: the query region.
- Returns:
a collection of overlapping protein features.
- Return type:
Collection[ProteinFeature]
- class gpsea.model.ProteinFeature[source]
Bases:
object
- static create(info: FeatureInfo, feature_type: str) ProteinFeature [source]
- abstract property info: FeatureInfo
- class gpsea.model.FeatureInfo(name: str, region: Region)[source]
Bases:
object
FeatureInfo represents a protein feature (e.g. a repeated sequence given the name “ANK 1” in protein “Ankyrin repeat domain-containing protein 11”)
- class gpsea.model.FeatureType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
An enum representing the protein feature types supported in GPSEA.
Use
from_string()
to parse a feature from Uniprot:>>> from gpsea.model import FeatureType >>> val = "Dna Binding" >>> feature_type = FeatureType.from_string(val) >>> feature_type.name 'DNA_BINDING'
- REPEAT = 1
A repeated sequence motif or repeated domain within the protein.
- MOTIF = 2
A short (usually not more than 20 amino acids) conserved sequence motif of biological significance.
- DOMAIN = 3
A specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.
- COILED_COIL = 4
a structural motif in proteins, characterized by two or more α-helices wrapped around each other in a supercoil. This structure is often involved in protein-protein interactions
- COMPOSITIONAL_BIAS = 5
Compositional bias refers to a region in a protein where certain amino acids are overrepresented compared to the rest of the protein or compared to typical protein composition. These regions tend to have a non-random distribution of amino acids, often leading to specific structural or functional properties.
- REGION = 6
A region of interest that cannot be described in other subsections.
- ZINC_FINGER = 7
A zinc finger is a small, functional, independently folded domain that coordinates one or more zinc ions to stabilize its structure through cysteine and/or histidine residues.
- TOPOLOGICAL_DOMAIN = 8
non-membrane region of a membrane-spanning protein
- TRANSMEMBRANE = 9
Section of a protein that goes through the membrane of the cell or an organelle
- DNA_BINDING = 10
A region with a known DNA binding activity.
- static from_string(category: str) FeatureType [source]