gpsea.preprocessing package
- gpsea.preprocessing.configure_caching_cohort_creator(hpo: MinimalOntology, genome_build: str = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, cache_dir: str | None = None, include_ontology_class_onsets: bool = True, variant_fallback: str = 'VEP', timeout: float | int = 30.0) CohortCreator[Phenopacket] [source]
A convenience function for configuring a caching
.To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional.
- Parameters:
hpo – a HPO instance.
genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.
validation_runner – an instance of the validation runner.
cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the cache location should be determined as described in
. In any case, the directory will be created if it does not exist (including non-existing parents).include_ontology_class_onsets – True if onsets in the ontology class format (e.g. HP:0003621 for Juvenile onset) should be included (default True).
variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from
(just one fallback implementation is available at the moment).timeout – timeout in seconds for the REST APIs
- gpsea.preprocessing.configure_cohort_creator(hpo: MinimalOntology, genome_build: Literal['GRCh37.p13', 'GRCh38.p13'] = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, include_ontology_class_onsets: bool = True, variant_fallback: str = 'VEP', timeout: float | int = 30.0) CohortCreator[Phenopacket] [source]
A convenience function for configuring a non-caching
.To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional
- Parameters:
hpo – a HPO instance.
genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.
validation_runner – an instance of the validation runner. if the data should be cached in .cache folder in the current working directory. In any case, the directory will be created if it does not exist (including non-existing parents).
include_ontology_class_onsets – True if onsets in the ontology class format (e.g. HP:0003621 for Juvenile onset) should be included (default True).
variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from
(just one fallback implementation is available at the moment).timeout – timeout in seconds for the VEP API
- gpsea.preprocessing.configure_default_tx_coordinate_service(tx_source: Literal['VV'] = 'VV', genome_build: GenomeBuild | Literal['hg19', 'hg38'] = 'hg38', cache_dir: str | None = None, timeout: float | int = 30.0) TranscriptCoordinateService [source]
- gpsea.preprocessing.configure_default_functional_annotator(ann_source: Literal['VEP'] = 'VEP', cache_dir: str | None = None, timeout: float | int = 30.0) FunctionalAnnotator [source]
- gpsea.preprocessing.configure_default_protein_metadata_service(protein_source: Literal['UNIPROT'] = 'UNIPROT', cache_dir: str | None = None, timeout: float | int = 30.0) ProteinMetadataService [source]
Create default protein metadata service that will cache the protein metadata in current working directory under .gpsea_cache/protein_cache and reach out to UNIPROT REST API if a cache entry is missing.
- Parameters:
protein_source – a str with the code of the protein data sources (currently accepting just UNIPROT).
cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the data should be cached as described by
function. In any case, the directory will be created if it does not exist (including any non-existing parents).timeout – a float or an int for the timeout in seconds for the REST APIs.
- gpsea.preprocessing.configure_protein_metadata_service(cache_dir: str | None = None, timeout: float = 30.0) ProteinMetadataService [source]
Configure default protein metadata service.
The service will cache the responses in cache_dir and reach out to UNIPROT API for cache misses.
- Parameters:
cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the data should be cached in .gpsea_cache folder in the current working directory. In any case, the directory will be created if it does not exist (including any non-existing parents).
timeout – timeout in seconds for the REST APIs.
- class gpsea.preprocessing.VariantCoordinateFinder[source]
]- abstractmethod find_coordinates(item: T) VariantCoordinates | None [source]
Try to find
from an item of some sort.The variant coordinates may not be available all the time, and None may be returned.
- Raises:
ValueError – if there is an error of any kind.
- class gpsea.preprocessing.FunctionalAnnotator[source]
- abstractmethod annotate(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation] [source]
Compute functional annotations for the variant coordinates. The annotations can be empty.
Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:
- class gpsea.preprocessing.ImpreciseSvFunctionalAnnotator[source]
Annotator for large SVs that lack the exact breakpoint coordinates.
- abstractmethod annotate(item: ImpreciseSvInfo) Sequence[TranscriptAnnotation] [source]
Compute functional annotations for a large SV.
Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:
- class gpsea.preprocessing.ProteinMetadataService[source]
A service for obtaining annotations for a given protein accession ID.
The annotations include elements of the
class.- abstractmethod annotate(protein_id: str) ProteinMetadata [source]
Prepare ProteinMetadata for a protein with given protein_id accession ID.
- Parameters:
protein_id (str) – A accession ID str (e.g. NP_001027558.1)
- Returns:
a ProteinMetadata container with the protein metadata
- Return type:
- Raises:
- class gpsea.preprocessing.PatientCreator[source]
]PatientCreator can create a Patient from some input T.
- class gpsea.preprocessing.CohortCreator(patient_creator: PatientCreator[T])[source]
]CohortCreator creates a cohort from an iterable of some T where T represents a cohort member.
- class gpsea.preprocessing.PhenopacketVariantCoordinateFinder(build: GenomeBuild, hgvs_coordinate_finder: VariantCoordinateFinder[str])[source]
]PhenopacketVariantCoordinateFinder figures out
from GenomicInterpretation element of Phenopacket Schema.- Parameters:
build – genome build to use in VariantCoordinates
hgvs_coordinate_finder – the coordinate finder to use for parsing HGVS expressions
- find_coordinates(item: GenomicInterpretation) VariantCoordinates | None [source]
Tries to extract the variant coordinates from the GenomicInterpretation.
- Parameters:
item (GenomicInterpretation) – a genomic interpretation element from Phenopacket Schema
- Returns:
variant coordinates
- Return type:
- class gpsea.preprocessing.PhenopacketPatientCreator(hpo: MinimalOntology, validator: ValidationRunner, build: GenomeBuild, functional_annotator: FunctionalAnnotator, imprecise_sv_functional_annotator: ImpreciseSvFunctionalAnnotator, hgvs_coordinate_finder: VariantCoordinateFinder[str], term_onset_parser: PhenopacketOntologyTermOnsetParser | None = None)[source]
]PhenopacketPatientCreator transforms Phenopacket into
.- Parameters:
hpo – HPO as
.validator – validation runner to check HPO terms.
build – the genome build to use for variants.
phenotype_creator – a phenotype creator for creating phenotypes.
functional_annotator – for computing functional annotations.
imprecise_sv_functional_annotator – for getting info about imprecise variants.
hgvs_coordinate_finder – for finding chromosomal coordinates for HGVS variant descriptions.
- class gpsea.preprocessing.PhenopacketOntologyTermOnsetParser(term_id_to_age: Mapping[str, Age])[source]
Parser for mapping an onset formatted as an ontology class to the corresponding
.Each HPO onset includes start and end bounds (e.g. 29th day to 16th year for Pediatric onset) of the onset range and the onset is mapped into the midpoint of the range.
Use default_parser to create the parser for parsing current HPO or provide the curie ->
mapping via __init__.- static default_parser() PhenopacketOntologyTermOnsetParser [source]
- gpsea.preprocessing.load_phenopacket_folder(pp_directory: str, cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') Tuple[Cohort, PreprocessingValidationResult] [source]
Load phenopacket JSON files from a directory, validate the patient data, and assemble the patients into a cohort.
A file with .json suffix is considered to be a JSON file and all JSON files are assumed to be phenopackets. Non-JSON files are ignored.
- Parameters:
pp_directory – path to a folder with phenopacket JSON files. An error is raised if the path does not point to a directory with at least one phenopacket.
cohort_creator – cohort creator for turning a sequence of phenopacket into a
.validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}
- Returns:
a tuple with the cohort and the validation result.
- gpsea.preprocessing.load_phenopacket_files(pp_files: Iterator[str], cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') Tuple[Cohort, PreprocessingValidationResult] [source]
Load phenopacket JSON files, validate the data, and assemble into a
.Phenopackets are validated, assembled into a cohort, and the validation results are reported back.
- Parameters:
pp_files – an iterator with paths to phenopacket JSON files.
cohort_creator – cohort creator for turning a phenopacket collection into a
.validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}
- Returns:
a tuple with the cohort and the validation result.
- gpsea.preprocessing.load_phenopackets(phenopackets: Iterable[Phenopacket], cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') Tuple[Cohort, PreprocessingValidationResult] [source]
Validate the phenopackets and assemble into a
.The results of the validation are reported back.
- Parameters:
phenopackets – path to a folder with phenopacket JSON files. An error is raised if the path does not point to a directory with at least one phenopacket.
cohort_creator – cohort creator for turning a sequence of phenopacket into a
.validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}
- Returns:
a tuple with the cohort and the validation result.
- class gpsea.preprocessing.PreprocessingValidationResult(policy: str, notepad: Notepad)[source]
The result of input validation of patient data.
- is_ok() bool [source]
Test if the result is OK considering the used validation policy.
If False is returned, use
method to print out the results.- Returns:
True if the analysis can proceed or False if errors/warnings were found.
- summarize(file: ~typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, indent: int = 2)[source]
Summarize the validation results into the provided file.
- Parameters:
file – where to write the validation summary (e.g.
, )indent – a non-negative int for the to indent the output
- class gpsea.preprocessing.TranscriptCoordinateService[source]
TranscriptCoordinateService gets transcript (tx) coordinates for a given transcript ID.
- abstractmethod fetch(tx: str | TranscriptInfoAware) TranscriptCoordinates [source]
Get tx coordinates for a tx ID or an entity that knows about the tx ID.
The method will raise an exception in case of an issue.
- Parameters:
tx – a str with tx ID (e.g. NM_002834.5) or an entity that knows about the transcript ID (e.g.
Returns: the transcript coordinates.
- class gpsea.preprocessing.GeneCoordinateService[source]
GeneCoordinateService gets transcript (Tx) coordinates for a gene ID.
- abstractmethod fetch_for_gene(gene: str) Sequence[TranscriptCoordinates] [source]
Get Tx coordinates for a gene ID.
The method will raise an exception in case of an issue.
- Parameters:
gene – a str with tx ID (e.g. HGNC:3603)
- Returns:
a sequence of transcript coordinates for the gene.
- Return type:
- class gpsea.preprocessing.UniprotProteinMetadataService(timeout: float = 30.0)[source]
A class that creates ProteinMetadata objects from data found with the Uniprot REST API. More info on the Uniprot REST API are in the Programmatic access section.
- annotate(protein_id: str) ProteinMetadata [source]
Get metadata for given protein ID. This class specifically only works with a RefSeq database ID (e.g. NP_037407.4).
- Parameters:
protein_id (str) – A protein ID
- Returns:
corresponding to the input protein_id.- Return type:
- Raises:
ValueError – in case of issues with protein_id, I/O issues, or parsing the REST response.
- static parse_uniprot_json(payload: Mapping[str, Any], protein_id: str) ProteinMetadata [source]
Try to extract ProteinMetadata corresponding to protein_id from the Uniprot JSON payload.
- Parameters:
payload – a JSON object corresponding to Uniprot response
protein_id – a str with the accession the protein of interest
- Returns:
a complete instance of ProteinMetadata
- Raises:
- class gpsea.preprocessing.VepFunctionalAnnotator(include_computational_txs: bool = False, timeout: float = 10.0)[source]
VepFunctionalAnnotator uses the Variant Effect Predictor (VEP) REST API to perform functional variant annotation.
Non-coding variant effects where we do not complain if the functional annotation lacks the protein effects.
- annotate(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation] [source]
Compute functional annotations for the variant coordinates. The annotations can be empty.
Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:
- fetch_response(variant_coordinates: VariantCoordinates) Mapping[str, Any] [source]
Get a dict with the response from the VEP REST API. :param variant_coordinates: a query
- static format_coordinates_for_vep_query(vc: VariantCoordinates) str [source]
Converts the 0-based VariantCoordinates to ones that will be interpreted correctly by VEP
Example - an insertion/duplication of G after the given G at coordinate 3: 1 2 3 4 5 A C G T A
0-based: 2 3 G GG 1-based: 3 G GG VEP: 4 3 - G
- Parameters:
vc (VariantCoordinates) – A VariantCoordinates object
- Returns:
The variant coordinates formatted to work with VEP
- Return type:
- class gpsea.preprocessing.VVHgvsVariantCoordinateFinder(genome_build: GenomeBuild, timeout: int = 30)[source]
]VVHgvsVariantCoordinateFinder uses Variant Validator’s REST API to build
from an HGVS string.The finder takes an HGVS str (e.g. NM_005912.3:c.253A>G) and extracts the variant coordinates from the response.
- Parameters:
genome_build – the genome build to use to construct
timeout – the REST API request timeout
- find_coordinates(item: str) VariantCoordinates | None [source]
Extracts variant coordinates from an HGVS string using Variant Validator’s REST API.
- Parameters:
item – a hgvs string
- Returns:
variant coordinates
- class gpsea.preprocessing.VVMultiCoordinateService(genome_build: GenomeBuild, timeout: float = 30.0)[source]
VVMultiCoordinateService uses the Variant Validator REST API to fetch transcript coordinates for both a gene ID and a specific transcript ID.
- Parameters:
genome_build – the genome build for constructing the transcript coordinates.
timeout – a positive float with the REST API timeout in seconds.
- fetch(tx: str | TranscriptInfoAware) TranscriptCoordinates [source]
Get tx coordinates for a tx ID or an entity that knows about the tx ID.
The method will raise an exception in case of an issue.
- Parameters:
tx – a str with tx ID (e.g. NM_002834.5) or an entity that knows about the transcript ID (e.g.
Returns: the transcript coordinates.
- fetch_for_gene(gene: str) Sequence[TranscriptCoordinates] [source]
Get Tx coordinates for a gene ID.
The method will raise an exception in case of an issue.
- Parameters:
gene – a str with tx ID (e.g. HGNC:3603)
- Returns:
a sequence of transcript coordinates for the gene.
- Return type:
- parse_multiple(response: Mapping[str, Any]) Sequence[TranscriptCoordinates] [source]
Parses the response in JSON format into TranscriptCoordinates.
May ignore a transcript if encountering an error during parse.
- Raises:
VariantValidatorDecodeException – if an error is encountered during response parsing.
- parse_response(tx_id: str, response) TranscriptCoordinates [source]
Parses the response in JSON format into TranscriptCoordinates.
- Raises:
VariantValidatorDecodeException – if an error is encountered during response parsing.
- exception gpsea.preprocessing.VariantValidatorDecodeException[source]
An exception raised on failure of parsing of the Variant Validator response.
- class gpsea.preprocessing.DefaultImpreciseSvFunctionalAnnotator(gene_coordinate_service: GeneCoordinateService)[source]
- annotate(item: ImpreciseSvInfo) Sequence[TranscriptAnnotation] [source]
Compute functional annotations for a large SV.
Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.: