gpsea.model.genome package
gpsea.model.genome provides data classes to model GenomeBuild
, Contig
,
and (genomic) regions (Region
, GenomicRegion
).
The classes can do basic region arithmetics such as finding intersections, overlaps, and distances between regions. Genomic regions are transposable - they can flip the coordinates between DNA strands.
The module provides GRCh37.p13 and GRCh38.p13, the two most commonly used human genome builds.
The classes are largely a port of Svart library.
- class gpsea.model.genome.GenomeBuild(identifier: GenomeBuildIdentifier, contigs: Iterable[Contig])[source]
Bases:
object
GenomeBuild is a container for the
genome_build_id
and thecontigs
of the build.The build supports retrieving contig by various identifiers:
>>> from gpsea.model.genome import GRCh38
>>> chr1 = GRCh38.contig_by_name('1') # by sequence name
>>> assert chr1 == GRCh38.contig_by_name('CM000663.2') # by GenBank identifier >>> assert chr1 == GRCh38.contig_by_name('NC_000001.11') # by RefSeq accession >>> assert chr1 == GRCh38.contig_by_name('chr1') # by UCSC name
- property genome_build_id: GenomeBuildIdentifier
- class gpsea.model.genome.Contig(name: str, gb_acc: str, refseq_name: str, ucsc_name: str, length: int)[source]
Bases:
Sized
Contig represents identifiers and length of a contiguous sequence of genome assembly.
The identifiers include:
name
e.g. 1genbank_acc
e.g. CM000663.2refseq_name
e.g. NC_000001.11ucsc_name
e.g. chr1
The length of a Contig represents the number of bases of the contig sequence.
You should not try to create a Contig on your own, but always get it from a
GenomeBuild
.
- class gpsea.model.genome.GenomeBuildIdentifier(major_assembly: str, patch: str)[source]
Bases:
object
Identifier of the genome build consisting of
major_assembly
(e.g. GRCh38) andpatch
(e.g. p13).- Parameters:
major_assembly – major assembly str
patch – assembly patch str
- property identifier
Get genome build identifier consisting of major assembly + patch, e.g. GRCh38.p13
- class gpsea.model.genome.Region(start: int, end: int)[source]
Bases:
Sized
Region represents a contiguous region/slice of a biological sequence, such as DNA, RNA, or protein.
The
start
andend
represent 0-based coordinates of the region. The region has length that corresponds to the number of spanned bases/aminoacids.- Parameters:
start – 0-based (excluded) start coordinate of the region.
end – 0-based (included) end coordinate of the region.
- overlaps_with(other: Region) bool [source]
Test if this Region overlaps with the other Region.
- Parameters:
other – another
Region
- contains(other: Region) bool [source]
Test if this Region contains the other region.
Empty interval other is considered as being contained in self if other lies on either boundary of self.
- Parameters:
other – another
Region
- contains_pos(pos: int) bool [source]
Test if this Region contains the base or protein located at the pos. Note, pos is represented by a 1-based coordinate system.
No bound checking is done here and False is returned for a position that is e.g. out of bounds of a contig, or for a negative pos. For
Stranded
entities, the position is assumed to be located on strand of theGenomicRegion
.An empty region contains no positions.
- Parameters:
pos – an int with 1-based position to check.
Returns: True if the Region contains the base/aminoacid located at pos.
- distance_to(other: Region) int [source]
Calculate the number of bases present between this and the other region.
The distance is zero if the regions are adjacent or if they overlap. The distance is positive if this is upstream (left) of other and negative if this is located downstream (right) of other :param other: other
Region
Returns: an int with the distance.
- class gpsea.model.genome.GenomicRegion(contig: Contig, start: int, end: int, strand: Strand)[source]
Bases:
Transposable
,Region
GenomicRegion represents a region located on strand of a DNA contig.
- Parameters:
contig – name of the contig, e.g. 15, chrX.
start – 0-based (excluded) start coordinate of the region.
end – 0-based (included) end coordinate of the region.
strand – the strand of the genomic region, True for forward strand or False for reverse.
- overlaps_with(other: GenomicRegion) bool [source]
Test if this GenomicRegion overlaps with the other GenomicRegion.
Empty intervals are NOT considered as overlapping if they are at the boundaries of the other interval. However, two empty intervals with the same start and end coordinates are considered as overlapping. This method is transitive such that overlaps_with(a, b) = overlaps_with(b, a). Genomic regions located on different contigs do not overlap.
- Parameters:
other – other
GenomicRegion
- contains(other: GenomicRegion) bool [source]
Check if this GenomicRegion contains the other genomic region.
Empty interval other is considered as being contained in self if other lies on either boundary of self. Genomic region located on a different contig is never contained.
- Parameters:
other – other
GenomicRegion
.
- distance_to(other: GenomicRegion) int [source]
Calculate the number of bases present between this GenomicRegion and the other genomic region.
The distance is zero if the regions are adjacent or if they overlap. The distance is positive if this is upstream (left) of other and negative if this is located downstream (right) of other :param other: other
GenomicRegion
Returns: an int with the distance. Raises: ValueError if the other region is on a different contig.
- class gpsea.model.genome.Strand(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
Enum
Strand is an enum to model positive and negative strands of double-stranded sequences, such as DNA.
- POSITIVE = ('+',)
The positive strand of a double stranded sequence.
- NEGATIVE = ('-',)
The negative strand of a double stranded sequence.
- class gpsea.model.genome.Stranded[source]
Bases:
object
Mixin for classes that are on double-stranded sequences.
- class gpsea.model.genome.Transposable[source]
Bases:
Stranded
Transposable elements know how to flip themselves to arbitrary
Strand
of a sequence.
- gpsea.model.genome.transpose_coordinate(contig: Contig, coordinate: int) int [source]
Transpose a 0-based coordinate to other strand of the contig. :param contig: contig to transpose the coordinate on. :param coordinate: the coordinate to transpose.
Returns: an int with transposed coordinate. Raises: ValueError if the coordinate is out of contig bounds.