gpsea.preprocessing package
- gpsea.preprocessing.configure_caching_cohort_creator(hpo: MinimalOntology, genome_build: str = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, cache_dir: str | None = None, include_ontology_class_onsets: bool = True, variant_fallback: str = 'VEP', timeout: float = 30.0) CohortCreator[Phenopacket] [source]
A convenience function for configuring a caching
PhenopacketPatientCreator
.To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional.
- Parameters:
hpo – a HPO instance.
genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.
validation_runner – an instance of the validation runner.
cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the cache location should be determined as described in
get_cache_dir_path()
. In any case, the directory will be created if it does not exist (including non-existing parents).include_ontology_class_onsets – True if onsets in the ontology class format (e.g. HP:0003621 for Juvenile onset) should be included (default True).
variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from
{'VEP'}
(just one fallback implementation is available at the moment).timeout – timeout in seconds for the REST APIs
- gpsea.preprocessing.configure_cohort_creator(hpo: MinimalOntology, genome_build: str = 'GRCh38.p13', validation_runner: ValidationRunner | None = None, include_ontology_class_onsets: bool = True, variant_fallback: str = 'VEP', timeout: float = 30.0) CohortCreator[Phenopacket] [source]
A convenience function for configuring a non-caching
PhenopacketPatientCreator
.To create the patient creator, we need hpo-toolkit’s representation of HPO. Other options are optional
- Parameters:
hpo – a HPO instance.
genome_build – name of the genome build to use, choose from {‘GRCh37.p13’, ‘GRCh38.p13’}.
validation_runner – an instance of the validation runner. if the data should be cached in .cache folder in the current working directory. In any case, the directory will be created if it does not exist (including non-existing parents).
include_ontology_class_onsets – True if onsets in the ontology class format (e.g. HP:0003621 for Juvenile onset) should be included (default True).
variant_fallback – the fallback variant annotator to use if we cannot find the annotation locally. Choose from
{'VEP'}
(just one fallback implementation is available at the moment).timeout – timeout in seconds for the VEP API
- gpsea.preprocessing.configure_default_protein_metadata_service(protein_source: Literal['UNIPROT'] = 'UNIPROT', cache_dir: str | None = None, timeout: float = 30.0) ProteinMetadataService [source]
Create default protein metadata service that will cache the protein metadata in current working directory under .gpsea_cache/protein_cache and reach out to UNIPROT REST API if a cache entry is missing.
- Parameters:
protein_source – a str with the code of the protein data sources (currently accepting just UNIPROT).
cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the data should be cached as described by
get_cache_dir_path()
function. In any case, the directory will be created if it does not exist (including any non-existing parents).timeout – timeout in seconds for the REST APIs.
- gpsea.preprocessing.configure_protein_metadata_service(cache_dir: str | None = None, timeout: float = 30.0) ProteinMetadataService [source]
Configure default protein metadata service.
The service will cache the responses in cache_dir and reach out to UNIPROT API for cache misses.
- Parameters:
cache_dir – path to the folder where we will cache the results fetched from the remote APIs or None if the data should be cached in .gpsea_cache folder in the current working directory. In any case, the directory will be created if it does not exist (including any non-existing parents).
timeout – timeout in seconds for the REST APIs.
- class gpsea.preprocessing.VariantCoordinateFinder[source]
Bases:
Generic
[T
]- abstract find_coordinates(item: T) VariantCoordinates | None [source]
Try to find
VariantCoordinates
from an item of some sort.The variant coordinates may not be available all the time, and None may be returned.
- Raises:
ValueError – if there is an error of any kind.
- class gpsea.preprocessing.FunctionalAnnotator[source]
Bases:
object
- abstract annotate(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation] [source]
Compute functional annotations for the variant coordinates. The annotations can be empty.
Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:
- class gpsea.preprocessing.ImpreciseSvFunctionalAnnotator[source]
Bases:
object
Annotator for large SVs that lack the exact breakpoint coordinates.
- abstract annotate(item: ImpreciseSvInfo) Sequence[TranscriptAnnotation] [source]
Compute functional annotations for a large SV.
Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.:
- class gpsea.preprocessing.ProteinMetadataService[source]
Bases:
object
A service for obtaining annotations for a given protein accession ID.
The annotations include elements of the
ProteinMetadata
class.- abstract annotate(protein_id: str) ProteinMetadata [source]
Prepare ProteinMetadata for a protein with given protein_id accession ID.
- Parameters:
protein_id (str) – A accession ID str (e.g. NP_001027558.1)
- Returns:
a ProteinMetadata container with the protein metadata
- Return type:
- Raises:
- class gpsea.preprocessing.PatientCreator[source]
Bases:
Generic
[T
]PatientCreator can create a Patient from some input T.
- class gpsea.preprocessing.CohortCreator(patient_creator: PatientCreator[T])[source]
Bases:
Generic
[T
]CohortCreator creates a cohort from an iterable of some T where T represents a cohort member.
- class gpsea.preprocessing.PhenopacketVariantCoordinateFinder(build: GenomeBuild, hgvs_coordinate_finder: VariantCoordinateFinder[str])[source]
Bases:
VariantCoordinateFinder
[GenomicInterpretation
]PhenopacketVariantCoordinateFinder figures out
VariantCoordinates
andGenotype
from GenomicInterpretation element of Phenopacket Schema.- Parameters:
build – genome build to use in VariantCoordinates
hgvs_coordinate_finder – the coordinate finder to use for parsing HGVS expressions
- find_coordinates(item: GenomicInterpretation) VariantCoordinates | None [source]
Tries to extract the variant coordinates from the GenomicInterpretation.
- Parameters:
item (GenomicInterpretation) – a genomic interpretation element from Phenopacket Schema
- Returns:
variant coordinates
- Return type:
- class gpsea.preprocessing.PhenopacketPatientCreator(hpo: MinimalOntology, validator: ValidationRunner, build: GenomeBuild, functional_annotator: FunctionalAnnotator, imprecise_sv_functional_annotator: ImpreciseSvFunctionalAnnotator, hgvs_coordinate_finder: VariantCoordinateFinder[str], term_onset_parser: PhenopacketOntologyTermOnsetParser | None = None)[source]
Bases:
PatientCreator
[Phenopacket
]PhenopacketPatientCreator transforms Phenopacket into
Patient
.- Parameters:
hpo – HPO as
MinimalOntology
.validator – validation runner to check HPO terms.
build – the genome build to use for variants.
phenotype_creator – a phenotype creator for creating phenotypes.
functional_annotator – for computing functional annotations.
imprecise_sv_functional_annotator – for getting info about imprecise variants.
hgvs_coordinate_finder – for finding chromosomal coordinates for HGVS variant descriptions.
- class gpsea.preprocessing.PhenopacketOntologyTermOnsetParser(term_id_to_age: Mapping[str, Age])[source]
Bases:
object
Parser for mapping an onset formatted as an ontology class to the corresponding
Age
.Each HPO onset includes start and end bounds (e.g. 29th day to 16th year for Pediatric onset) of the onset range and the onset is mapped into the midpoint of the range.
Use default_parser to create the parser for parsing current HPO or provide the curie ->
Age
mapping via __init__.- static default_parser() PhenopacketOntologyTermOnsetParser [source]
- gpsea.preprocessing.load_phenopacket_folder(pp_directory: str, cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') Tuple[Cohort, PreprocessingValidationResult] [source]
Load phenopacket JSON files from a directory, validate the patient data, and assemble the patients into a cohort.
A file with .json suffix is considered to be a JSON file and all JSON files are assumed to be phenopackets. Non-JSON files are ignored.
- Parameters:
pp_directory – path to a folder with phenopacket JSON files. An error is raised if the path does not point to a directory with at least one phenopacket.
cohort_creator – cohort creator for turning a sequence of phenopacket into a
Cohort
.validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}
- Returns:
a tuple with the cohort and the validation result.
- gpsea.preprocessing.load_phenopacket_files(pp_files: Iterator[str], cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') Tuple[Cohort, PreprocessingValidationResult] [source]
Load phenopacket JSON files, validate the data, and assemble into a
Cohort
.Phenopackets are validated, assembled into a cohort, and the validation results are reported back.
- Parameters:
pp_files – an iterator with paths to phenopacket JSON files.
cohort_creator – cohort creator for turning a phenopacket collection into a
Cohort
.validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}
- Returns:
a tuple with the cohort and the validation result.
- gpsea.preprocessing.load_phenopackets(phenopackets: Iterable[Phenopacket], cohort_creator: CohortCreator[Phenopacket], validation_policy: Literal['permissive', 'lenient', 'strict'] = 'permissive') Tuple[Cohort, PreprocessingValidationResult] [source]
Validate the phenopackets and assemble into a
Cohort
.The results of the validation are reported back.
- Parameters:
phenopackets – path to a folder with phenopacket JSON files. An error is raised if the path does not point to a directory with at least one phenopacket.
cohort_creator – cohort creator for turning a sequence of phenopacket into a
Cohort
.validation_policy – a str with the validation policy. The value must be one of {‘permissive’, ‘lenient’, ‘strict’}
- Returns:
a tuple with the cohort and the validation result.
- class gpsea.preprocessing.PreprocessingValidationResult(policy: str, notepad: Notepad)[source]
Bases:
object
The result of input validation of patient data.
- is_ok() bool [source]
Test if the result is OK considering the used validation policy.
If False is returned, use
summarize()
method to print out the results.- Returns:
True if the analysis can proceed or False if errors/warnings were found.
- summarize(file: ~typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, indent: int = 2)[source]
Summarize the validation results into the provided file.
- Parameters:
file – where to write the validation summary (e.g.
io.StringIO
, )indent – a non-negative int for the to indent the output
- class gpsea.preprocessing.TranscriptCoordinateService[source]
Bases:
object
TranscriptCoordinateService gets transcript (tx) coordinates for a given transcript ID.
- abstract fetch(tx: str | TranscriptInfoAware) TranscriptCoordinates [source]
Get tx coordinates for a tx ID or an entity that knows about the tx ID.
The method will raise an exception in case of an issue.
- Parameters:
tx – a str with tx ID (e.g. NM_002834.5) or an entity that knows about the transcript ID (e.g.
TranscriptAnnotation
).
Returns: the transcript coordinates.
- class gpsea.preprocessing.GeneCoordinateService[source]
Bases:
object
GeneCoordinateService gets transcript (Tx) coordinates for a gene ID.
- abstract fetch_for_gene(gene: str) Sequence[TranscriptCoordinates] [source]
Get Tx coordinates for a gene ID.
The method will raise an exception in case of an issue.
- Parameters:
gene – a str with tx ID (e.g. HGNC:3603)
- Returns:
a sequence of transcript coordinates for the gene.
- Return type:
- class gpsea.preprocessing.ProteinAnnotationCache(datadir: str)[source]
Bases:
object
A class that stores or retrieves ProteinMetadata objects using pickle format
- get_annotations(protein_id
str): Searches a given data directory for a pickle file with given ID and returns ProteinMetadata
- store_annotations(protein_id
str, annotation:Sequence[ProteinMetadata]): Creates a pickle file with given ID and stores the given ProteinMetadata into that file
- get_annotations(protein_id: str) ProteinMetadata | None [source]
Searches a given data directory for a pickle file with given ID and returns ProteinMetadata from file. Returns None if no file is found.
- Parameters:
protein_id (str) – The protein_id associated with the desired ProteinMetadata
- store_annotations(protein_id: str, annotation: ProteinMetadata)[source]
Creates a pickle file with the given protein id in the file name. Loads the ProteinMetadata given into the file for storage.
- Parameters:
protein_id (str) – The protein_id associated with the ProteinMetadata
annotation (Sequence[ProteinMetadata]) – A sequence of ProteinMetadata objects that will be stored under the given protein id
- class gpsea.preprocessing.ProtCachingMetadataService(cache: ProteinAnnotationCache, fallback: ProteinMetadataService)[source]
Bases:
ProteinMetadataService
A class that retrieves ProteinMetadata if it exists or will run the fallback Fuctional Annotator if it does not exist.
- annotate(protein_id
str): Gets metadata and returns ProteinMetadata for given protein ID
- annotate(protein_id: str) ProteinMetadata [source]
Gets metadata for given protein ID
- Parameters:
protein_id (str) – A protein ID
- Returns:
A ProteinMetadata object
- Return type:
- class gpsea.preprocessing.UniprotProteinMetadataService(timeout: float = 30.0)[source]
Bases:
ProteinMetadataService
A class that creates ProteinMetadata objects from data found with the Uniprot REST API. More info on the Uniprot REST API are in the Programmatic access section.
- static parse_uniprot_json(payload: Mapping[str, Any], protein_id: str) ProteinMetadata [source]
Try to extract ProteinMetadata corresponding to protein_id from the Uniprot JSON payload.
- Parameters:
payload – a JSON object corresponding to Uniprot response
protein_id – a str with the accession the protein of interest
- Returns:
a complete instance of ProteinMetadata
- Raises:
- annotate(protein_id: str) ProteinMetadata [source]
Get metadata for given protein ID. This class specifically only works with a RefSeq database ID (e.g. NP_037407.4).
- Parameters:
protein_id (str) – A protein ID
- Returns:
A
ProteinMetadata
corresponding to the input protein_id.- Return type:
- Raises:
ValueError – in case of issues with protein_id, I/O issues, or parsing the REST response.
- class gpsea.preprocessing.VepFunctionalAnnotator(include_computational_txs: bool = False, timeout: float = 10.0)[source]
Bases:
FunctionalAnnotator
A FunctionalAnnotator that uses Variant Effect Predictor (VEP) REST API to do functional variant annotation.
- Parameters:
- NONCODING_EFFECTS = {VariantEffect.DOWNSTREAM_GENE_VARIANT, VariantEffect.FIVE_PRIME_UTR_VARIANT, VariantEffect.INTERGENIC_VARIANT, VariantEffect.INTRON_VARIANT, VariantEffect.NON_CODING_TRANSCRIPT_EXON_VARIANT, VariantEffect.NON_CODING_TRANSCRIPT_VARIANT, VariantEffect.SPLICE_ACCEPTOR_VARIANT, VariantEffect.SPLICE_DONOR_5TH_BASE_VARIANT, VariantEffect.SPLICE_DONOR_VARIANT, VariantEffect.SPLICE_POLYPYRIMIDINE_TRACT_VARIANT, VariantEffect.THREE_PRIME_UTR_VARIANT, VariantEffect.UPSTREAM_GENE_VARIANT}
Non-coding variant effects where we do not complain if the functional annotation lacks the protein effects.
- annotate(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation] [source]
Perform functional annotation using Variant Effect Predictor (VEP) REST API.
- Parameters:
variant_coordinates (VariantCoordinates) – A VariantCoordinates object
- Returns:
A sequence of transcript annotations for the variant coordinates
- Return type:
- Raises:
ValueError if VEP times out or does not return a response or if the response is not formatted as we expect. –
- process_response(variant_key: str, response: Mapping[str, Any]) Sequence[TranscriptAnnotation] [source]
- fetch_response(variant_coordinates: VariantCoordinates) Mapping[str, Any] [source]
Get a dict with the response from the VEP REST API. :param variant_coordinates: a query
VariantCoordinates
.
- static format_coordinates_for_vep_query(vc: VariantCoordinates) str [source]
Converts the 0-based VariantCoordinates to ones that will be interpreted correctly by VEP
Example - an insertion/duplication of G after the given G at coordinate 3: 1 2 3 4 5 A C G T A
0-based: 2 3 G GG 1-based: 3 G GG VEP: 4 3 - G
- Parameters:
vc (VariantCoordinates) – A VariantCoordinates object
- Returns:
The variant coordinates formatted to work with VEP
- Return type:
- class gpsea.preprocessing.VariantAnnotationCache(datadir: str)[source]
Bases:
object
A class that stores or retrieves Variant objects using pickle format
- get_annotations(variant_coordinates
VariantCoordinates): Searches a given data directory for a pickle file with variant coordinates and returns a Variant object
- store_annotations(variant_coordinates
VariantCoordinates, annotation:Variant): Creates a pickle file with variant coordinates and stores the given Variant object into that file
- get_annotations(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation] | None [source]
Searches a given data directory for a pickle file with given variant coordinates and returns Variant from file. Returns None if no file is found.
- Parameters:
variant_coordinates (VariantCoordinates) – The variant_coordinates associated with the desired Variant
- store_annotations(variant_coordinates: VariantCoordinates, annotations: Sequence[TranscriptAnnotation])[source]
Creates a pickle file with the given variant coordinates in the file name. Loads the Variant object given into the file for storage.
- Parameters:
variant_coordinates (VariantCoordinates) – The variant_coordinates associated with the desired Variant
annotations (Sequence[TranscriptAnnotation]) – Annotations that will be stored under the given variant coordinates
- class gpsea.preprocessing.VarCachingFunctionalAnnotator(cache: VariantAnnotationCache, fallback: FunctionalAnnotator)[source]
Bases:
FunctionalAnnotator
A class that retrieves a Variant object if it exists or will run the fallback Fuctional Annotator if it does not exist.
- annotate(variant_coordinates
VariantCoordinates): Gets data and returns a Variant object for given variant coordinates
- static with_cache_folder(fpath_cache_dir: str, fallback: FunctionalAnnotator)[source]
Create caching functional annotator that will store the data in fpath_cache_dir and use fallback to annotate the missing variants.
- annotate(variant_coordinates: VariantCoordinates) Sequence[TranscriptAnnotation] [source]
Gets Variant for given variant coordinates
- Parameters:
variant_coordinates (VariantCoordinates) – A VariantCoordinates object
- Returns:
A Variant object
- Return type:
- class gpsea.preprocessing.VVHgvsVariantCoordinateFinder(genome_build: GenomeBuild, timeout: int = 30)[source]
Bases:
VariantCoordinateFinder
[str
]VVHgvsVariantCoordinateFinder uses Variant Validator’s REST API to build
VariantCoordinates
from an HGVS string.The finder takes an HGVS str (e.g. NM_005912.3:c.253A>G) and extracts the variant coordinates from the response.
- Parameters:
genome_build – the genome build to use to construct
VariantCoordinates
timeout – the REST API request timeout
- find_coordinates(item: str) VariantCoordinates | None [source]
Extracts variant coordinates from an HGVS string using Variant Validator’s REST API.
- Parameters:
item – a hgvs string
- Returns:
variant coordinates
- class gpsea.preprocessing.VVMultiCoordinateService(genome_build: GenomeBuild, timeout: float = 30.0)[source]
Bases:
TranscriptCoordinateService
,GeneCoordinateService
VVMultiCoordinateService uses the Variant Validator REST API to fetch transcript coordinates for both a gene ID and a specific transcript ID.
- Parameters:
genome_build – the genome build for constructing the transcript coordinates.
timeout – a positive float with the REST API timeout in seconds.
- fetch(tx: str | TranscriptInfoAware) TranscriptCoordinates [source]
Get tx coordinates for a tx ID or an entity that knows about the tx ID.
The method will raise an exception in case of an issue.
- Parameters:
tx – a str with tx ID (e.g. NM_002834.5) or an entity that knows about the transcript ID (e.g.
TranscriptAnnotation
).
Returns: the transcript coordinates.
- fetch_for_gene(gene: str) Sequence[TranscriptCoordinates] [source]
Get Tx coordinates for a gene ID.
The method will raise an exception in case of an issue.
- Parameters:
gene – a str with tx ID (e.g. HGNC:3603)
- Returns:
a sequence of transcript coordinates for the gene.
- Return type:
- parse_response(tx_id: str, response) TranscriptCoordinates [source]
- class gpsea.preprocessing.DefaultImpreciseSvFunctionalAnnotator(gene_coordinate_service: GeneCoordinateService)[source]
Bases:
ImpreciseSvFunctionalAnnotator
- annotate(item: ImpreciseSvInfo) Sequence[TranscriptAnnotation] [source]
Compute functional annotations for a large SV.
Returns: a sequence of transcript annotations :raises ValueError if the annotation cannot proceed due to the remote resource being offline, etc.: