gpsea.preprocessing package

gpsea.preprocessing.configure_caching_cohort_creator(hpo: MinimalOntology, genome_build: str = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, cache_dir: str | None = None, include_ontology_class_onsets: bool = True, variant_fallback: str = 'VEP', timeout: float | int = 30.0) → CohortCreator[Phenopacket][source]

A convenience function for configuring a caching PhenopacketPatientCreator.

To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional.

Parameters:

hpo – a HPO instance.
genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.
validation_runner – an instance of the validation runner.
cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the cache location should be determined as described in get_cache_dir_path(). In any case, the directory will be created if it does not exist (including non-existing parents).
include_ontology_class_onsets – True if onsets in the ontology class format (e.g. HP:0003621 for Juvenile onset) should be included (default True).
variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from {'VEP'} (just one fallback implementation is available at the moment).
timeout – timeout in seconds for the REST APIs

gpsea.preprocessing.configure_cohort_creator(hpo: MinimalOntology, genome_build: Literal['GRCh37.p13', 'GRCh38.p13'] = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, include_ontology_class_onsets: bool = True, variant_fallback: str = 'VEP', timeout: float | int = 30.0) → CohortCreator[Phenopacket][source]

A convenience function for configuring a non-caching PhenopacketPatientCreator.

To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional

Parameters:

hpo – a HPO instance.
genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.
validation_runner – an instance of the validation runner. if the data should be cached in .cache folder in the current working directory. In any case, the directory will be created if it does not exist (including non-existing parents).
include_ontology_class_onsets – True if onsets in the ontology class format (e.g. HP:0003621 for Juvenile onset) should be included (default True).
variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from {'VEP'} (just one fallback implementation is available at the moment).
timeout – timeout in seconds for the VEP API

gpsea.preprocessing.configure_default_tx_coordinate_service(tx_source: Literal['VV'] = 'VV', genome_build: GenomeBuild | Literal['hg19', 'hg38'] = 'hg38', cache_dir: str | None = None, timeout: float | int = 30.0) → TranscriptCoordinateService[source]

gpsea.preprocessing.configure_default_functional_annotator(ann_source: Literal['VEP'] = 'VEP', cache_dir: str | None = None, timeout: float | int = 30.0) → FunctionalAnnotator[source]

gpsea.preprocessing.configure_default_protein_metadata_service(protein_source: Literal['UNIPROT'] = 'UNIPROT', cache_dir: str | None = None, timeout: float | int = 30.0) → ProteinMetadataService[source]

Create default protein metadata service that will cache the protein metadata in current working directory under .gpsea_cache/protein_cache and reach out to UNIPROT REST API if a cache entry is missing.

Parameters:

protein_source – a str with the code of the protein data sources (currently accepting just UNIPROT).
cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the data should be cached as described by get_cache_dir_path() function. In any case, the directory will be created if it does not exist (including any non-existing parents).
timeout – a float or an int for the timeout in seconds for the REST APIs.

gpsea.preprocessing.configure_protein_metadata_service(cache_dir: str | None = None, timeout: float = 30.0) → ProteinMetadataService[source]

Configure default protein metadata service.

The service will cache the responses in cache_dir and reach out to UNIPROT API for cache misses.

Parameters:

cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the data should be cached in .gpsea_cache folder in the current working directory. In any case, the directory will be created if it does not exist (including any non-existing parents).
timeout – timeout in seconds for the REST APIs.

class gpsea.preprocessing.VariantCoordinateFinder[source]

Bases: Generic[T]

abstractmethod find_coordinates(item: T) → VariantCoordinates | None[source]

Try to find VariantCoordinates from an item of some sort.

The variant coordinates may not be available all the time, and None may be returned.

Raises:: ValueError – if there is an error of any kind.

class gpsea.preprocessing.FunctionalAnnotator[source]

Bases: object

abstractmethod annotate(variant_coordinates: VariantCoordinates) → Sequence[TranscriptAnnotation][source]

Compute functional annotations for the variant coordinates. The annotations can be empty.

Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:

class gpsea.preprocessing.ImpreciseSvFunctionalAnnotator[source]

Bases: object

Annotator for large SVs that lack the exact breakpoint coordinates.

abstractmethod annotate(item: ImpreciseSvInfo) → Sequence[TranscriptAnnotation][source]

Compute functional annotations for a large SV.

Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:

class gpsea.preprocessing.ProteinMetadataService[source]

Bases: object

A service for obtaining annotations for a given protein accession ID.

The annotations include elements of the ProteinMetadata class.

abstractmethod annotate(protein_id: str) → ProteinMetadata[source]

Prepare ProteinMetadata for a protein with given protein_id accession ID.

Parameters:: protein_id (str) – A accession ID str (e.g. NP_001027558.1)
Returns:: a ProteinMetadata container with the protein metadata
Return type:: ProteinMetadata
Raises:: ValueError –

class gpsea.preprocessing.PatientCreator[source]

Bases: Generic[T]

PatientCreator can create a Patient from some input T.

abstractmethod process(item: T, notepad: Notepad) → Patient | None[source]

class gpsea.preprocessing.CohortCreator(patient_creator: PatientCreator[T])[source]

Bases: Generic[T]

CohortCreator creates a cohort from an iterable of some T where T represents a cohort member.

process(inputs: Iterable[T], notepad: Notepad) → Cohort[source]

class gpsea.preprocessing.PhenopacketVariantCoordinateFinder(build: GenomeBuild, hgvs_coordinate_finder: VariantCoordinateFinder[str])[source]

Bases: VariantCoordinateFinder[GenomicInterpretation]

PhenopacketVariantCoordinateFinder figures out VariantCoordinates and Genotype from GenomicInterpretation element of Phenopacket Schema.

Parameters:

build – genome build to use in VariantCoordinates
hgvs_coordinate_finder – the coordinate finder to use for parsing HGVS expressions

find_coordinates(item: GenomicInterpretation) → VariantCoordinates | None[source]

Tries to extract the variant coordinates from the GenomicInterpretation.

Parameters:: item (GenomicInterpretation) – a genomic interpretation element from Phenopacket Schema
Returns:: variant coordinates
Return type:: Optional[VariantCoordinates]

class gpsea.preprocessing.PhenopacketPatientCreator(hpo: MinimalOntology, validator: ValidationRunner, build: GenomeBuild, functional_annotator: FunctionalAnnotator, imprecise_sv_functional_annotator: ImpreciseSvFunctionalAnnotator, hgvs_coordinate_finder: VariantCoordinateFinder[str], term_onset_parser: PhenopacketOntologyTermOnsetParser | None = None)[source]

Bases: PatientCreator[Phenopacket]

PhenopacketPatientCreator transforms Phenopacket into Patient.

Parameters:

hpo – HPO as MinimalOntology.
validator – validation runner to check HPO terms.
build – the genome build to use for variants.
phenotype_creator – a phenotype creator for creating phenotypes.
functional_annotator – for computing functional annotations.
imprecise_sv_functional_annotator – for getting info about imprecise variants.
hgvs_coordinate_finder – for finding chromosomal coordinates for HGVS variant descriptions.

process(pp: Phenopacket, notepad: Notepad) → Patient | None[source]

Creates a Patient from the data in a given Phenopacket

Parameters:

pp (Phenopacket) – A Phenopacket object
notepad (Notepad) – notepad to write down the issues

Returns:

A Patient object

Return type:

Patient

class gpsea.preprocessing.PhenopacketOntologyTermOnsetParser(term_id_to_age: Mapping[str, Age])[source]

Bases: object

Parser for mapping an onset formatted as an ontology class to the corresponding Age.

Each HPO onset includes start and end bounds (e.g. 29th day to 16th year for Pediatric onset) of the onset range and the onset is mapped into the midpoint of the range.

Use default_parser to create the parser for parsing current HPO or provide the curie -> Age mapping via __init__.

static default_parser() → PhenopacketOntologyTermOnsetParser[source]

process(ontology_class: OntologyClass, notepad: Notepad) → Age | None[source]

gpsea.preprocessing.load_phenopacket_folder(pp_directory: str, cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') → Tuple[Cohort, PreprocessingValidationResult][source]

Load phenopacket JSON files from a directory, validate the patient data, and assemble the patients into a cohort.

A file with .json suffix is considered to be a JSON file and all JSON files are assumed to be phenopackets. Non-JSON files are ignored.

Parameters:

pp_directory – path to a folder with phenopacket JSON files. An error is raised if the path does not point to a directory with at least one phenopacket.
cohort_creator – cohort creator for turning a sequence of phenopacket into a Cohort.
validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}

Returns:

a tuple with the cohort and the validation result.

gpsea.preprocessing.load_phenopacket_files(pp_files: Iterator[str], cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') → Tuple[Cohort, PreprocessingValidationResult][source]

Load phenopacket JSON files, validate the data, and assemble into a Cohort.

Phenopackets are validated, assembled into a cohort, and the validation results are reported back.

Parameters:

pp_files – an iterator with paths to phenopacket JSON files.
cohort_creator – cohort creator for turning a phenopacket collection into a Cohort.
validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}

Returns:

a tuple with the cohort and the validation result.

gpsea.preprocessing.load_phenopackets(phenopackets: Iterable[Phenopacket], cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') → Tuple[Cohort, PreprocessingValidationResult][source]

Validate the phenopackets and assemble into a Cohort.

The results of the validation are reported back.

Parameters:

phenopackets – path to a folder with phenopacket JSON files. An error is raised if the path does not point to a directory with at least one phenopacket.
cohort_creator – cohort creator for turning a sequence of phenopacket into a Cohort.
validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}

Returns:

a tuple with the cohort and the validation result.

class gpsea.preprocessing.PreprocessingValidationResult(policy: str, notepad: Notepad)[source]

Bases: object

The result of input validation of patient data.

is_ok() → bool[source]

Test if the result is OK considering the used validation policy.

If False is returned, use summarize() method to print out the results.

Returns:: True if the analysis can proceed or False if errors/warnings were found.

property policy: str: Get the used validation policy.

summarize(file: ~typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, indent: int = 2)[source]

Summarize the validation results into the provided file.

Parameters:

file – where to write the validation summary (e.g. io.StringIO, )
indent – a non-negative int for the to indent the output

class gpsea.preprocessing.TranscriptCoordinateService[source]

Bases: object

TranscriptCoordinateService gets transcript (tx) coordinates for a given transcript ID.

abstractmethod fetch(tx: str | TranscriptInfoAware) → TranscriptCoordinates[source]

Get tx coordinates for a tx ID or an entity that knows about the tx ID.

The method will raise an exception in case of an issue.

Parameters:: tx – a str with tx ID (e.g. NM_002834.5) or an entity that knows about the transcript ID (e.g. TranscriptAnnotation).

Returns: the transcript coordinates.

class gpsea.preprocessing.GeneCoordinateService[source]

Bases: object

GeneCoordinateService gets transcript (Tx) coordinates for a gene ID.

abstractmethod fetch_for_gene(gene: str) → Sequence[TranscriptCoordinates][source]

Get Tx coordinates for a gene ID.

The method will raise an exception in case of an issue.

Parameters:: gene – a str with tx ID (e.g. HGNC:3603)
Returns:: a sequence of transcript coordinates for the gene.
Return type:: Sequence[TranscriptCoordinates]

class gpsea.preprocessing.UniprotProteinMetadataService(timeout: float = 30.0)[source]

Bases: ProteinMetadataService

A class that creates ProteinMetadata objects from data found with the Uniprot REST API. More info on the Uniprot REST API are in the Programmatic access section.

annotate(protein_id: str) → ProteinMetadata[source]

Get metadata for given protein ID. This class specifically only works with a RefSeq database ID (e.g. NP_037407.4).

Parameters:: protein_id (str) – A protein ID
Returns:: A ProteinMetadata corresponding to the input protein_id.
Return type:: ProteinMetadata
Raises:: ValueError – in case of issues with protein_id, I/O issues, or parsing the REST response.

static parse_uniprot_json(payload: Mapping[str, Any], protein_id: str) → ProteinMetadata[source]

Try to extract ProteinMetadata corresponding to protein_id from the Uniprot JSON payload.

Parameters:

payload – a JSON object corresponding to Uniprot response
protein_id – a str with the accession the protein of interest

Returns:

a complete instance of ProteinMetadata

Raises:

ValueError –

class gpsea.preprocessing.VepFunctionalAnnotator(include_computational_txs: bool = False, timeout: float = 10.0)[source]

Bases: FunctionalAnnotator

VepFunctionalAnnotator uses the Variant Effect Predictor (VEP) REST API to perform functional variant annotation.

NONCODING_EFFECTS = {VariantEffect.DOWNSTREAM_GENE_VARIANT, VariantEffect.FIVE_PRIME_UTR_VARIANT, VariantEffect.INTERGENIC_VARIANT, VariantEffect.INTRON_VARIANT, VariantEffect.NON_CODING_TRANSCRIPT_EXON_VARIANT, VariantEffect.NON_CODING_TRANSCRIPT_VARIANT, VariantEffect.SPLICE_ACCEPTOR_VARIANT, VariantEffect.SPLICE_DONOR_5TH_BASE_VARIANT, VariantEffect.SPLICE_DONOR_VARIANT, VariantEffect.SPLICE_POLYPYRIMIDINE_TRACT_VARIANT, VariantEffect.THREE_PRIME_UTR_VARIANT, VariantEffect.UPSTREAM_GENE_VARIANT}: Non-coding variant effects where we do not complain if the functional annotation lacks the protein effects.

annotate(variant_coordinates: VariantCoordinates) → Sequence[TranscriptAnnotation][source]

Compute functional annotations for the variant coordinates. The annotations can be empty.

Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:

fetch_response(variant_coordinates: VariantCoordinates) → Mapping[str, Any][source]: Get a dict with the response from the VEP REST API. :param variant_coordinates: a query VariantCoordinates.

static format_coordinates_for_vep_query(vc: VariantCoordinates) → str[source]

Converts the 0-based VariantCoordinates to ones that will be interpreted correctly by VEP

Example - an insertion/duplication of G after the given G at coordinate 3: 1 2 3 4 5 A C G T A

0-based: 2 3 G GG 1-based: 3 G GG VEP: 4 3 - G

Parameters:: vc (VariantCoordinates) – A VariantCoordinates object
Returns:: The variant coordinates formatted to work with VEP
Return type:: str

process_response(variant_key: str, response: Mapping[str, Any]) → Sequence[TranscriptAnnotation][source]

class gpsea.preprocessing.VVHgvsVariantCoordinateFinder(genome_build: GenomeBuild, timeout: int = 30)[source]

Bases: VariantCoordinateFinder[str]

VVHgvsVariantCoordinateFinder uses Variant Validator’s REST API to build VariantCoordinates from an HGVS string.

The finder takes an HGVS str (e.g. NM_005912.3:c.253A>G) and extracts the variant coordinates from the response.

Parameters:

genome_build – the genome build to use to construct VariantCoordinates
timeout – the REST API request timeout

find_coordinates(item: str) → VariantCoordinates | None[source]

Extracts variant coordinates from an HGVS string using Variant Validator’s REST API.

Parameters:: item – a hgvs string
Returns:: variant coordinates

class gpsea.preprocessing.VVMultiCoordinateService(genome_build: GenomeBuild, timeout: float = 30.0)[source]

Bases: TranscriptCoordinateService, GeneCoordinateService

VVMultiCoordinateService uses the Variant Validator REST API to fetch transcript coordinates for both a gene ID and a specific transcript ID.

Parameters:

genome_build – the genome build for constructing the transcript coordinates.
timeout – a positive float with the REST API timeout in seconds.

fetch(tx: str | TranscriptInfoAware) → TranscriptCoordinates[source]

Get tx coordinates for a tx ID or an entity that knows about the tx ID.

The method will raise an exception in case of an issue.

Parameters:: tx – a str with tx ID (e.g. NM_002834.5) or an entity that knows about the transcript ID (e.g. TranscriptAnnotation).

Returns: the transcript coordinates.

fetch_for_gene(gene: str) → Sequence[TranscriptCoordinates][source]

Get Tx coordinates for a gene ID.

The method will raise an exception in case of an issue.

Parameters:: gene – a str with tx ID (e.g. HGNC:3603)
Returns:: a sequence of transcript coordinates for the gene.
Return type:: Sequence[TranscriptCoordinates]

get_response(tx_id: str)[source]

parse_multiple(response: Mapping[str, Any]) → Sequence[TranscriptCoordinates][source]

Parses the response in JSON format into TranscriptCoordinates.

May ignore a transcript if encountering an error during parse.

Raises:: VariantValidatorDecodeException – if an error is encountered during response parsing.

parse_response(tx_id: str, response) → TranscriptCoordinates[source]

Parses the response in JSON format into TranscriptCoordinates.

Raises:: VariantValidatorDecodeException – if an error is encountered during response parsing.

exception gpsea.preprocessing.VariantValidatorDecodeException[source]

Bases: Exception

An exception raised on failure of parsing of the Variant Validator response.

class gpsea.preprocessing.DefaultImpreciseSvFunctionalAnnotator(gene_coordinate_service: GeneCoordinateService)[source]

Bases: ImpreciseSvFunctionalAnnotator

annotate(item: ImpreciseSvInfo) → Sequence[TranscriptAnnotation][source]

Compute functional annotations for a large SV.

Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.: