gpsea.model.genome package

gpsea.model.genome provides data classes to model GenomeBuild, Contig, and (genomic) regions (Region, GenomicRegion).

The classes can do basic region arithmetics such as finding intersections, overlaps, and distances between regions. Genomic regions are transposable - they can flip the coordinates between DNA strands.

The module provides GRCh37.p13 and GRCh38.p13, the two most commonly used human genome builds.

The classes are largely a port of Svart library.

class gpsea.model.genome.GenomeBuild(identifier: GenomeBuildIdentifier, contigs: Iterable[Contig])[source]

Bases: object

GenomeBuild is a container for the genome_build_id and the contigs of the build.

The build supports retrieving contig by various identifiers:


>>> from gpsea.model.genome import GRCh38
>>> chr1 = GRCh38.contig_by_name('1')  # by sequence name
>>> assert chr1 == GRCh38.contig_by_name('CM000663.2')    # by GenBank identifier
>>> assert chr1 == GRCh38.contig_by_name('NC_000001.11')  # by RefSeq accession
>>> assert chr1 == GRCh38.contig_by_name('chr1')    # by UCSC name
property identifier: str
property genome_build_id: GenomeBuildIdentifier
property contigs: Sequence[Contig]
contig_by_name(name: str) Contig | None[source]

Get a contig with name.

The name can come in various formats:

  • sequence name, e.g. 1

  • GenBank accession, e.g. CM000663.2

  • RefSeq accession, e.g. NC_000001.11

  • UCSC name, e.g. chr1

Parameters:

name – a str with contig name.

class gpsea.model.genome.Contig(name: str, gb_acc: str, refseq_name: str, ucsc_name: str, length: int)[source]

Bases: Sized

Contig represents identifiers and length of a contiguous sequence of genome assembly.

The identifiers include:

The length of a Contig represents the number of bases of the contig sequence.

You should not try to create a Contig on your own, but always get it from a GenomeBuild.

property name: str
property genbank_acc: str
property refseq_name: str
property ucsc_name: str
class gpsea.model.genome.GenomeBuildIdentifier(major_assembly: str, patch: str)[source]

Bases: object

Identifier of the genome build consisting of major_assembly (e.g. GRCh38) and patch (e.g. p13).

Parameters:
  • major_assembly – major assembly str

  • patch – assembly patch str

property major_assembly: str

Get major assembly, e.g. GRCh38.

property patch: str

Get assembly patch , e.g. p13.

property identifier

Get genome build identifier consisting of major assembly + patch, e.g. GRCh38.p13

class gpsea.model.genome.Region(start: int, end: int)[source]

Bases: Sized

Region represents a contiguous region/slice of a biological sequence, such as DNA, RNA, or protein.

The start and end represent 0-based coordinates of the region. The region has length that corresponds to the number of spanned bases/aminoacids.

Parameters:
  • start – 0-based (excluded) start coordinate of the region.

  • end – 0-based (included) end coordinate of the region.

property start: int

Get 0-based (excluded) start coordinate of the region.

property end: int

Get 0-based (included) end coordinate of the region.

overlaps_with(other: Region) bool[source]

Test if this Region overlaps with the other Region.

Parameters:

other – another Region

contains(other: Region) bool[source]

Test if this Region contains the other region.

Empty interval other is considered as being contained in self if other lies on either boundary of self.

Parameters:

other – another Region

contains_pos(pos: int) bool[source]

Test if this Region contains the base or protein located at the pos. Note, pos is represented by a 1-based coordinate system.

No bound checking is done here and False is returned for a position that is e.g. out of bounds of a contig, or for a negative pos. For Stranded entities, the position is assumed to be located on strand of the GenomicRegion.

An empty region contains no positions.

Parameters:

pos – an int with 1-based position to check.

Returns: True if the Region contains the base/aminoacid located at pos.

distance_to(other: Region) int[source]

Calculate the number of bases present between this and the other region.

The distance is zero if the regions are adjacent or if they overlap. The distance is positive if this is upstream (left) of other and negative if this is located downstream (right) of other :param other: other Region

Returns: an int with the distance.

is_empty() bool[source]

Return True if the region is empty, i.e. it spans 0 units/bases/aminoacids…

class gpsea.model.genome.GenomicRegion(contig: Contig, start: int, end: int, strand: Strand)[source]

Bases: Transposable, Region

GenomicRegion represents a region located on strand of a DNA contig.

Parameters:
  • contig – name of the contig, e.g. 15, chrX.

  • start – 0-based (excluded) start coordinate of the region.

  • end – 0-based (included) end coordinate of the region.

  • strand – the strand of the genomic region, True for forward strand or False for reverse.

property contig: Contig
start_on_strand(other: Strand) int[source]
end_on_strand(other: Strand) int[source]
property strand: Strand
with_strand(other: Strand)[source]
overlaps_with(other: GenomicRegion) bool[source]

Test if this GenomicRegion overlaps with the other GenomicRegion.

Empty intervals are NOT considered as overlapping if they are at the boundaries of the other interval. However, two empty intervals with the same start and end coordinates are considered as overlapping. This method is transitive such that overlaps_with(a, b) = overlaps_with(b, a). Genomic regions located on different contigs do not overlap.

Parameters:

other – other GenomicRegion

contains(other: GenomicRegion) bool[source]

Check if this GenomicRegion contains the other genomic region.

Empty interval other is considered as being contained in self if other lies on either boundary of self. Genomic region located on a different contig is never contained.

Parameters:

other – other GenomicRegion.

distance_to(other: GenomicRegion) int[source]

Calculate the number of bases present between this GenomicRegion and the other genomic region.

The distance is zero if the regions are adjacent or if they overlap. The distance is positive if this is upstream (left) of other and negative if this is located downstream (right) of other :param other: other GenomicRegion

Returns: an int with the distance. Raises: ValueError if the other region is on a different contig.

class gpsea.model.genome.Strand(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Strand is an enum to model positive and negative strands of double-stranded sequences, such as DNA.

POSITIVE = ('+',)

The positive strand of a double stranded sequence.

NEGATIVE = ('-',)

The negative strand of a double stranded sequence.

property symbol: str

Get a str with the strand’s sign.

is_positive()[source]

Return True if this is the positive strand.

is_negative()[source]

Return True if this is the negative strand.

opposite()[source]

Get the opposite strand of the current strand.

class gpsea.model.genome.Stranded[source]

Bases: object

Mixin for classes that are on double-stranded sequences.

abstract property strand: Strand
class gpsea.model.genome.Transposable[source]

Bases: Stranded

Transposable elements know how to flip themselves to arbitrary Strand of a sequence.

abstract with_strand(other: Strand)[source]
to_opposite_strand()[source]
to_positive_strand()[source]
to_negative_strand()[source]
gpsea.model.genome.transpose_coordinate(contig: Contig, coordinate: int) int[source]

Transpose a 0-based coordinate to other strand of the contig. :param contig: contig to transpose the coordinate on. :param coordinate: the coordinate to transpose.

Returns: an int with transposed coordinate. Raises: ValueError if the coordinate is out of contig bounds.