Skip to content

monarch-gene-mapping

Details

GitHub monarch-initiative/monarch-gene-mapping
Language Python
Description Code for mapping source namespaces to preffered namespacing

Dependencies

External Dependencies

Package Version
python ^3.9
pandas ^2.1.3
kghub-downloader ^0.3.4
sssom ^0.4
typer ^0.7
prefixmaps 0.1.7

Documentation

monarch-gene-mapping

Code for mapping source namespaces to preferred namespacing

Strategy

This repository creates SSSOM mappings between gene identifiers for use in the Monarch Knowledge Graph. Gene naming authorities (HGNC, Model Organism Databases) are the preferred identifiers, with NCBIGene as a fallback. We prefer the naming sources as the first choice source for mappings. When the naming authority doesn't provide a mapping for an identifier we need to map from, we will use the source of that identifier as a fallback. Finally, a third party gene mapping may be used as a last resort.

Installation

poetry install

Usage

python -m monarch_gene_mapping.main --help

is a simple UI for processing the mapping data.

Special Data Considerations

The UniProtKB ID mappings file is huge: about an eleven (11) gigabyte gzip compressed archive (as of November 2022). The Monarch Initiative only targets a subset of species in this file. The standard procedure is to 'pre-filter' the data after downloading but before continued processing. This is the default 'generate' process, but the monarch_gene_mapping.main script allows for discrete processing of this step.