gpsea.preprocessing package

gpsea.preprocessing.configure_caching_patient_creator(hpo: MinimalOntology, genome_build: str = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, cache_dir: str | None = None, variant_fallback: str = 'VEP', timeout: float = 30.0) PhenopacketPatientCreator[source]

A convenience function for configuring a caching PhenopacketPatientCreator.

To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional.

Parameters:
  • hpo – a HPO instance.

  • genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.

  • validation_runner – an instance of the validation runner.

  • cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the cache location should be determined as described in get_cache_dir_path(). In any case, the directory will be created if it does not exist (including non-existing parents).

  • variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from {'VEP'} (just one fallback implementation is available at the moment).

  • timeout – timeout in seconds for the REST APIs

gpsea.preprocessing.configure_patient_creator(hpo: MinimalOntology, genome_build: str = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, variant_fallback: str = 'VEP', validation: str = 'lenient', timeout: float = 30.0) PhenopacketPatientCreator[source]

A convenience function for configuring a non-caching PhenopacketPatientCreator.

To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional

Parameters:
  • hpo – a HPO instance.

  • genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.

  • validation_runner – an instance of the validation runner. if the data should be cached in .gpsea_cache folder in the current working directory. In any case, the directory will be created if it does not exist (including non-existing parents).

  • variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from {'VEP'} (just one fallback implementation is available at the moment).

  • timeout – timeout in seconds for the REST APIs

gpsea.preprocessing.configure_caching_cohort_creator(hpo: MinimalOntology, genome_build: str = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, cache_dir: str | None = None, variant_fallback: str = 'VEP', timeout: float = 30.0) CohortCreator[Phenopacket][source]

A convenience function for configuring a caching PhenopacketPatientCreator.

To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional.

Parameters:
  • hpo – a HPO instance.

  • genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.

  • validation_runner – an instance of the validation runner.

  • cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the cache location should be determined as described in get_cache_dir_path(). In any case, the directory will be created if it does not exist (including non-existing parents).

  • variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from {'VEP'} (just one fallback implementation is available at the moment).

  • timeout – timeout in seconds for the REST APIs

gpsea.preprocessing.configure_cohort_creator(hpo: MinimalOntology, genome_build: str = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, variant_fallback: str = 'VEP', timeout: float = 30.0) CohortCreator[Phenopacket][source]

A convenience function for configuring a non-caching PhenopacketPatientCreator.

To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional

Parameters:
  • hpo – a HPO instance.

  • genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.

  • validation_runner – an instance of the validation runner. if the data should be cached in .cache folder in the current working directory. In any case, the directory will be created if it does not exist (including non-existing parents).

  • variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from {'VEP'} (just one fallback implementation is available at the moment).

  • timeout – timeout in seconds for the VEP API

gpsea.preprocessing.configure_default_protein_metadata_service(protein_source: Literal['UNIPROT'] = 'UNIPROT', cache_dir: str | None = None, timeout: float = 30.0) ProteinMetadataService[source]

Create default protein metadata service that will cache the protein metadata in current working directory under .gpsea_cache/protein_cache and reach out to UNIPROT REST API if a cache entry is missing.

Parameters:
  • protein_source – a str with the code of the protein data sources (currently accepting just UNIPROT).

  • cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the data should be cached as described by get_cache_dir_path() function. In any case, the directory will be created if it does not exist (including any non-existing parents).

  • timeout – timeout in seconds for the REST APIs.

gpsea.preprocessing.configure_protein_metadata_service(cache_dir: str | None = None, timeout: float = 30.0) ProteinMetadataService[source]

Configure default protein metadata service.

The service will cache the responses in cache_dir and reach out to UNIPROT API for cache misses.

Parameters:
  • cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the data should be cached in .gpsea_cache folder in the current working directory. In any case, the directory will be created if it does not exist (including any non-existing parents).

  • timeout – timeout in seconds for the REST APIs.

class gpsea.preprocessing.VariantCoordinateFinder[source]

Bases: Generic[T]

abstract find_coordinates(item: T) VariantCoordinates | None[source]

Try to find VariantCoordinates from an item of some sort.

The variant coordinates may not be available all the time, and None may be returned.

Raises:

ValueError – if there is an error of any kind.

class gpsea.preprocessing.FunctionalAnnotator[source]

Bases: object

abstract annotate(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation][source]

Compute functional annotations for the variant coordinates. The annotations can be empty.

Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:

class gpsea.preprocessing.ImpreciseSvFunctionalAnnotator[source]

Bases: object

Annotator for large SVs that lack the exact breakpoint coordinates.

abstract annotate(item: ImpreciseSvInfo) Sequence[TranscriptAnnotation][source]

Compute functional annotations for a large SV.

Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:

class gpsea.preprocessing.ProteinMetadataService[source]

Bases: object

A service for obtaining annotations for a given protein accession ID.

The annotations include elements of the ProteinMetadata class.

abstract annotate(protein_id: str) ProteinMetadata[source]

Prepare ProteinMetadata for a protein with given protein_id accession ID.

Parameters:

protein_id (string) – A accession ID str (e.g. NP_001027558.1)

Returns:

a ProteinMetadata container with the protein metadata

Return type:

ProteinMetadata

Raises:

ValueError

class gpsea.preprocessing.PatientCreator[source]

Bases: Generic[T], Auditor[T, Patient]

PatientCreator can create a Patient from some input T.

PatientCreator is an Auditor, hence the input is sanitized and any errors are reported to the caller.

class gpsea.preprocessing.CohortCreator(patient_creator: PatientCreator[T])[source]

Bases: Generic[T], Auditor[Iterable[T], Cohort]

CohortCreator creates a cohort from an iterable of some T where T represents a cohort member.

process(inputs: Iterable[T], notepad: Notepad) Cohort[source]

Audit and sanitize the data, record the issues to the notepad and return the sanitized data.

class gpsea.preprocessing.PhenopacketVariantCoordinateFinder(build: GenomeBuild, hgvs_coordinate_finder: VariantCoordinateFinder[str])[source]

Bases: VariantCoordinateFinder[GenomicInterpretation]

PhenopacketVariantCoordinateFinder figures out VariantCoordinates and Genotype from GenomicInterpretation element of Phenopacket Schema.

Parameters:
  • build – genome build to use in VariantCoordinates

  • hgvs_coordinate_finder – the coordinate finder to use for parsing HGVS expressions

find_coordinates(item: GenomicInterpretation) VariantCoordinates | None[source]

Tries to extract the variant coordinates from the GenomicInterpretation.

Parameters:

item (GenomicInterpretation) – a genomic interpretation element from Phenopacket Schema

Returns:

variant coordinates

Return type:

Optional[VariantCoordinates]

class gpsea.preprocessing.PhenopacketPatientCreator(build: GenomeBuild, phenotype_creator: PhenotypeCreator, functional_annotator: FunctionalAnnotator, imprecise_sv_functional_annotator: ImpreciseSvFunctionalAnnotator, hgvs_coordinate_finder: VariantCoordinateFinder[str], assume_karyotypic_sex: bool = True)[source]

Bases: PatientCreator[Phenopacket]

PhenopacketPatientCreator transforms Phenopacket into Patient.

process(pp: Phenopacket, notepad: Notepad) Patient[source]

Creates a Patient from the data in a given Phenopacket

Parameters:
  • pp (Phenopacket) – A Phenopacket object

  • notepad (Notepad) – notepad to write down the issues

Returns:

A Patient object

Return type:

Patient

gpsea.preprocessing.load_phenopacket_folder(pp_directory: str, cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['none', 'lenient', 'strict'] = 'none') Tuple[Cohort, PreprocessingValidationResult][source]

Load phenopacket JSON files from a directory, validate the patient data, and assemble the patients into a cohort.

A file with .json suffix is considered to be a JSON file and all JSON files are assumed to be phenopackets. Non-JSON files are ignored.

Parameters:
  • pp_directory – path to a folder with phenopacket JSON files. An error is raised if the path does not point to a directory with at least one phenopacket.

  • cohort_creator – cohort creator for turning a sequence of phenopacket into a Cohort.

  • validation_policy – a str with the validation policy. The value must be one of {‘none’, ‘lenient’, ‘strict’}

Returns:

a tuple with the cohort and the validation result.

gpsea.preprocessing.load_phenopacket_files(pp_files: Iterator[str], cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['none', 'lenient', 'strict'] = 'none') Tuple[Cohort, PreprocessingValidationResult][source]

Load phenopacket JSON files, validate the data, and assemble into a Cohort.

Phenopackets are validated, assembled into a cohort, and the validation results are reported back.

Parameters:
  • pp_files – an iterator with paths to phenopacket JSON files.

  • cohort_creator – cohort creator for turning a phenopacket collection into a Cohort.

  • validation_policy – a str with the validation policy. The value must be one of {‘none’, ‘lenient’, ‘strict’}

Returns:

a tuple with the cohort and the validation result.

gpsea.preprocessing.load_phenopackets(phenopackets: Iterator[Phenopacket], cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['none', 'lenient', 'strict'] = 'none') Tuple[Cohort, PreprocessingValidationResult][source]

Validate the phenopackets and assemble into a Cohort.

The results of the validation are reported back.

Parameters:
  • phenopackets – path to a folder with phenopacket JSON files. An error is raised if the path does not point to a directory with at least one phenopacket.

  • cohort_creator – cohort creator for turning a sequence of phenopacket into a Cohort.

  • validation_policy – a str with the validation policy. The value must be one of {‘none’, ‘lenient’, ‘strict’}

Returns:

a tuple with the cohort and the validation result.

class gpsea.preprocessing.TranscriptCoordinateService[source]

Bases: object

TranscriptCoordinateService gets transcript (tx) coordinates for a given transcript ID.

abstract fetch(tx: str | TranscriptInfoAware) TranscriptCoordinates[source]

Get tx coordinates for a tx ID or an entity that knows about the tx ID.

The method will raise an exception in case of an issue.

Parameters:

tx – a str with tx ID (e.g. NM_002834.5) or an entity that knows about the transcript ID (e.g. TranscriptAnnotation).

Returns: the transcript coordinates.

class gpsea.preprocessing.GeneCoordinateService[source]

Bases: object

GeneCoordinateService gets transcript (Tx) coordinates for a gene ID.

abstract fetch_for_gene(gene: str) Sequence[TranscriptCoordinates][source]

Get Tx coordinates for a gene ID.

The method will raise an exception in case of an issue.

Parameters:

gene – a str with tx ID (e.g. HGNC:3603)

Returns:

a sequence of transcript coordinates for the gene.

Return type:

Sequence[TranscriptCoordinates]

class gpsea.preprocessing.PhenotypeCreator(hpo: MinimalOntology, validator: ValidationRunner)[source]

Bases: Auditor[Iterable[Tuple[str, bool]], Sequence[Phenotype]]

PhenotypeCreator validates the input phenotype features and prepares them for the downstream analysis.

The creator expects an iterable with tuples that contain a CURIE and status. The CURIE must correspond to a HPO term identifier and status must be a bool.

The creator prunes CURIES with simple errors such as malformed CURIE or non-HPO terms and validates the rest with HPO toolkit’s validator.

process(inputs: Iterable[Tuple[str, bool]], notepad: Notepad) Sequence[Phenotype][source]

Map CURIEs and observation states into phenotypes and validate the requirements.

Parameters:
  • inputs (Iterable[Tuple[str, bool]]) – 2-element tuples with a CURIE str and observation state as bool (True if phenotype was observed).

  • notepad – Notepad

Returns:

A sequence of phenotypes

class gpsea.preprocessing.ProteinAnnotationCache(datadir: str)[source]

Bases: object

A class that stores or retrieves ProteinMetadata objects using pickle format

get_annotations(protein_id

str): Searches a given data directory for a pickle file with given ID and returns ProteinMetadata

store_annotations(protein_id

str, annotation:Sequence[ProteinMetadata]): Creates a pickle file with given ID and stores the given ProteinMetadata into that file

get_annotations(protein_id: str) ProteinMetadata | None[source]

Searches a given data directory for a pickle file with given ID and returns ProteinMetadata from file. Returns None if no file is found.

Parameters:

protein_id (string) – The protein_id associated with the desired ProteinMetadata

store_annotations(protein_id: str, annotation: ProteinMetadata)[source]

Creates a pickle file with the given protein id in the file name. Loads the ProteinMetadata given into the file for storage.

Parameters:
  • protein_id (string) – The protein_id associated with the ProteinMetadata

  • annotation (Sequence[ProteinMetadata]) – A sequence of ProteinMetadata objects that will be stored under the given protein id

class gpsea.preprocessing.ProtCachingMetadataService(cache: ProteinAnnotationCache, fallback: ProteinMetadataService)[source]

Bases: ProteinMetadataService

A class that retrieves ProteinMetadata if it exists or will run the fallback Fuctional Annotator if it does not exist.

annotate(protein_id

str): Gets metadata and returns ProteinMetadata for given protein ID

annotate(protein_id: str) ProteinMetadata[source]

Gets metadata for given protein ID

Parameters:

protein_id (string) – A protein ID

Returns:

A ProteinMetadata object

Return type:

ProteinMetadata

class gpsea.preprocessing.UniprotProteinMetadataService(timeout: float = 30.0)[source]

Bases: ProteinMetadataService

A class that creates ProteinMetadata objects from data found with the Uniprot REST API. More info on the Uniprot REST API are in the Programmatic access section.

static parse_uniprot_json(payload: Mapping[str, Any], protein_id: str) ProteinMetadata[source]

Try to extract ProteinMetadata corresponding to protein_id from the Uniprot JSON payload.

Parameters:
  • payload – a JSON object corresponding to Uniprot response

  • protein_id – a str with the accession the protein of interest

Returns:

a complete instance of ProteinMetadata

Raises:

ValueError

annotate(protein_id: str) ProteinMetadata[source]

Get metadata for given protein ID. This class specifically only works with a RefSeq database ID (e.g. NP_037407.4).

Parameters:

protein_id (string) – A protein ID

Returns:

A sequence of ProteinMetadata objects, or an empty sequence if no data was found.

Return type:

Sequence[ProteinMetadata]

class gpsea.preprocessing.VepFunctionalAnnotator(include_computational_txs: bool = False, timeout: float = 10.0)[source]

Bases: FunctionalAnnotator

A FunctionalAnnotator that uses Variant Effect Predictor (VEP) REST API to do functional variant annotation.

Parameters:
  • include_computational_txs (bool) – Include computational transcripts, such as

  • XM_. (RefSeq)

  • timeout (int) – Timeout in seconds

NONCODING_EFFECTS = {VariantEffect.DOWNSTREAM_GENE_VARIANT, VariantEffect.FIVE_PRIME_UTR_VARIANT, VariantEffect.INTERGENIC_VARIANT, VariantEffect.INTRON_VARIANT, VariantEffect.NON_CODING_TRANSCRIPT_EXON_VARIANT, VariantEffect.NON_CODING_TRANSCRIPT_VARIANT, VariantEffect.SPLICE_ACCEPTOR_VARIANT, VariantEffect.SPLICE_DONOR_5TH_BASE_VARIANT, VariantEffect.SPLICE_DONOR_VARIANT, VariantEffect.SPLICE_POLYPYRIMIDINE_TRACT_VARIANT, VariantEffect.THREE_PRIME_UTR_VARIANT, VariantEffect.UPSTREAM_GENE_VARIANT}

Non-coding variant effects where we do not complain if the functional annotation lacks the protein effects.

annotate(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation][source]

Perform functional annotation using Variant Effect Predictor (VEP) REST API.

Parameters:

variant_coordinates (VariantCoordinates) – A VariantCoordinates object

Returns:

A sequence of transcript annotations for the variant coordinates

Return type:

Sequence[TranscriptAnnotation]

Raises:

ValueError if VEP times out or does not return a response or if the response is not formatted as we expect.

process_response(variant_key: str, response: Mapping[str, Any]) Sequence[TranscriptAnnotation][source]
fetch_response(variant_coordinates: VariantCoordinates) Mapping[str, Any][source]

Get a dict with the response from the VEP REST API. :param variant_coordinates: a query VariantCoordinates.

static format_coordinates_for_vep_query(vc: VariantCoordinates) str[source]

Converts the 0-based VariantCoordinates to ones that will be interpreted correctly by VEP

Example - an insertion/duplication of G after the given G at coordinate 3: 1 2 3 4 5 A C G T A

0-based: 2 3 G GG 1-based: 3 G GG VEP: 4 3 - G

Parameters:

vc (VariantCoordinates) – A VariantCoordinates object

Returns:

The variant coordinates formatted to work with VEP

Return type:

string

class gpsea.preprocessing.VariantAnnotationCache(datadir: str)[source]

Bases: object

A class that stores or retrieves Variant objects using pickle format

get_annotations(variant_coordinates

VariantCoordinates): Searches a given data directory for a pickle file with variant coordinates and returns a Variant object

store_annotations(variant_coordinates

VariantCoordinates, annotation:Variant): Creates a pickle file with variant coordinates and stores the given Variant object into that file

get_annotations(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation] | None[source]

Searches a given data directory for a pickle file with given variant coordinates and returns Variant from file. Returns None if no file is found.

Parameters:

variant_coordinates (VariantCoordinates) – The variant_coordinates associated with the desired Variant

store_annotations(variant_coordinates: VariantCoordinates, annotations: Sequence[TranscriptAnnotation])[source]

Creates a pickle file with the given variant coordinates in the file name. Loads the Variant object given into the file for storage.

Parameters:
class gpsea.preprocessing.VarCachingFunctionalAnnotator(cache: VariantAnnotationCache, fallback: FunctionalAnnotator)[source]

Bases: FunctionalAnnotator

A class that retrieves a Variant object if it exists or will run the fallback Fuctional Annotator if it does not exist.

annotate(variant_coordinates

VariantCoordinates): Gets data and returns a Variant object for given variant coordinates

static with_cache_folder(fpath_cache_dir: str, fallback: FunctionalAnnotator)[source]

Create caching functional annotator that will store the data in fpath_cache_dir and use fallback to annotate the missing variants.

annotate(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation][source]

Gets Variant for given variant coordinates

Parameters:

variant_coordinates (VariantCoordinates) – A VariantCoordinates object

Returns:

A Variant object

Return type:

Variant

class gpsea.preprocessing.VVHgvsVariantCoordinateFinder(genome_build: GenomeBuild, timeout: int = 30)[source]

Bases: VariantCoordinateFinder[str]

VVHgvsVariantCoordinateFinder uses Variant Validator’s REST API to build VariantCoordinates from an HGVS string.

The finder takes an HGVS str (e.g. NM_005912.3:c.253A>G) and extracts the variant coordinates from the response.

Parameters:
  • genome_build – the genome build to use to construct VariantCoordinates

  • timeout – the REST API request timeout

find_coordinates(item: str) VariantCoordinates | None[source]

Extracts variant coordinates from an HGVS string using Variant Validator’s REST API.

Parameters:

item – a hgvs string

Returns:

variant coordinates

class gpsea.preprocessing.VVMultiCoordinateService(genome_build: GenomeBuild, timeout: float = 30.0)[source]

Bases: TranscriptCoordinateService, GeneCoordinateService

VVMultiCoordinateService uses the Variant Validator REST API to fetch transcript coordinates for both a gene ID and a specific transcript ID.

Parameters:
  • genome_build – the genome build for constructing the transcript coordinates.

  • timeout – a positive float with the REST API timeout in seconds.

fetch(tx: str | TranscriptInfoAware) TranscriptCoordinates[source]

Get tx coordinates for a tx ID or an entity that knows about the tx ID.

The method will raise an exception in case of an issue.

Parameters:

tx – a str with tx ID (e.g. NM_002834.5) or an entity that knows about the transcript ID (e.g. TranscriptAnnotation).

Returns: the transcript coordinates.

fetch_for_gene(gene: str) Sequence[TranscriptCoordinates][source]

Get Tx coordinates for a gene ID.

The method will raise an exception in case of an issue.

Parameters:

gene – a str with tx ID (e.g. HGNC:3603)

Returns:

a sequence of transcript coordinates for the gene.

Return type:

Sequence[TranscriptCoordinates]

get_response(tx_id: str)[source]
parse_response(tx_id: str, response) TranscriptCoordinates[source]
parse_multiple(response: Mapping[str, Any]) Sequence[TranscriptCoordinates][source]
class gpsea.preprocessing.Auditor[source]

Bases: Generic[IN, OUT]

Auditor checks the inputs for sanity issues and relates the issues with sanitized inputs as SanitationResults.

The auditor may sanitize the input as a matter of discretion and returns the input as OUT.

static prepare_notepad(label: str) NotepadTree[source]

Prepare a Notepad for recording issues and errors.

Parameters:

label – a str with the top-level section label.

Returns:

an instance of NotepadTree.

Return type:

NotepadTree

abstract process(data: IN, notepad: Notepad) OUT[source]

Audit and sanitize the data, record the issues to the notepad and return the sanitized data.

class gpsea.preprocessing.DataSanityIssue(level: Level, message: str, solution: str | None = None)[source]

Bases: object

DataSanityIssue summarizes an issue found in the input data.

The issue has a level, a message with human-friendly description, and an optional solution for removing the issue.

property level: Level
property message: str
property solution: str | None
class gpsea.preprocessing.Level(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

An enum to represent severity of the DataSanityIssue.

WARN = 1

Warning is an issue when something not entirely right. However, unlike Level.ERROR, the analysis should complete albeit with sub-optimal results 😧.

ERROR = 2

Error is a serious issue in the input data and the downstream analysis may not complete or the analysis results may be malarkey 😱.

class gpsea.preprocessing.Notepad(label: str)[source]

Bases: object

Record issues encountered during parsing/validation of a hierarchical data structure.

The issues can be organized in sections. Notepad keeps track of issues in one section and the subsections can be created by calling add_subsection(). The function returns an instance responsible for issues of a subsection.

A collection of the issues from the current section are available via issues property and the convenience functions provide iterators over error and warnings.

abstract add_subsection(label: str) Notepad[source]

Add a labeled subsection.

Returns:

a notepad for recording issues within the subsection.

Return type:

Notepad

property label: str

Get a str with the section label.

property issues: Sequence[DataSanityIssue]

Get an iterable with the issues of the current section.

add_issue(level: Level, message: str, solution: str | None = None)[source]

Add an issue with certain level, message, and an optional solution.

add_error(message: str, solution: str | None = None)[source]

A convenience function for adding an error with a message and an optional solution.

add_warning(message: str, solution: str | None = None)[source]

A convenience function for adding a warning with a message and an optional solution.

errors() Iterator[DataSanityIssue][source]

Iterate over the errors of the current section.

warnings() Iterator[DataSanityIssue][source]

Iterate over the warnings of the current section.

error_count() int[source]
Returns:

count of errors found in this section.

Return type:

int

warning_count() int[source]
Returns:

count of warnings found in this section.

Return type:

int

class gpsea.preprocessing.NotepadTree(label: str, level: int)[source]

Bases: Notepad

NotepadTree implements Notepad using a tree where each tree node corresponds to a (sub)section. The node can have 0..n children.

Each node has a label, a collection of issues, and children with subsections. For convenience, the node has level to correspond to the depth of the node within the tree (the level of the root node is 0).

The nodes can be accessed via children property or through convenience methods for tree traversal, either using the visitor pattern (visit()) or by iterating over the nodes via iterate_nodes(). In both cases, the traversal is done in the depth-first fashion.

property children
property level: int
add_subsection(identifier: str)[source]

Add a labeled subsection.

Returns:

a notepad for recording issues within the subsection.

Return type:

Notepad

visit(visitor)[source]

Perform a depth-first search on the tree and call visitor with all nodes. :param visitor: a callable that takes the current node as a single argument.

iterate_nodes()[source]

Iterate over nodes in the depth-first fashion.

Returns: a depth-first node iterator.

has_warnings(include_subsections: bool = False) bool[source]
Returns:

True if one or more warnings were found in the current section or its subsections.

Return type:

bool

has_errors(include_subsections: bool = False) bool[source]
Returns:

True if one or more errors were found in the current section or its subsections.

Return type:

bool

has_errors_or_warnings(include_subsections: bool = False) bool[source]
Returns:

True if one or more errors or warnings were found in the current section or its subsections.

Return type:

bool

class gpsea.preprocessing.DefaultImpreciseSvFunctionalAnnotator(gene_coordinate_service: GeneCoordinateService)[source]

Bases: ImpreciseSvFunctionalAnnotator

annotate(item: ImpreciseSvInfo) Sequence[TranscriptAnnotation][source]

Compute functional annotations for a large SV.

Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.: