PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System
Panther Gene Orthology
Gene orthology analyses generate testable hypothesis about gene function and biological processes using experimental results from other (especially highly studied so-called 'model' species) using protein (and sometimes, simply nucleic acid level) alignments of genomic sequences. The source of gene orthology data for this ingest is from the PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System. Panther was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. Proteins have been classified according to: - Family and subfamily: families are groups of evolutionarily related proteins; subfamilies are related proteins that also have the same function - Molecular function: the function of the protein by itself or with directly interacting proteins at a biochemical level, e.g. a protein kinase - Biological process: the function of the protein in the context of a larger network of proteins that interact to accomplish a process at the level of the cell or organism, e.g. mitosis. - Pathway: similar to biological process, but a pathway also explicitly specifies the relationships between the interacting molecules.
The PANTHER Classifications are the result of human curation as well as sophisticated bioinformatics algorithms. Details of the methods can be found in Mi et al. NAR 2013; Thomas et al., Genome Research 2003.
This ingest uses data derived form the current version (release 16.0) of the Panther Hidden Markov Model (HMM).
There are various cross-sections of the Panther database which remain be covered by this ingest (Note: T.B.D means "To Be Done")
Status of Panther Ingest
The first iteration of this dataset (committed March 2022) focuses on Reference Genome Gene-to-Gene Orthology Relationships. Additional Panther associations (protein (sub)family pathways, sequences, etc, as generally described below) may be added at a later date.
Reference Genome Gene-to-Gene Orthology Relationships
Contains the Reference Genomes' Gene-to-Gene Ortholog mappings from Panther analyses.
- Source File: AllOrthologs.tar.gz.
The source file is huge, containing data from all species, many of which are not currently of direct interest to Monarch. For this reason, a Python function filter_panther_orthologs_file
was coded within orthology_utils.
ALL_ORTHOLOGS_FILE = "AllOrthologs"
TARGET_SPECIES_ORTHOLOGS = "TargetOrthologs"
def filter_panther_orthologs_file(
directory: str = '.',
source_filename: str = ALL_ORTHOLOGS_FILE,
target_filename: str = TARGET_SPECIES_ORTHOLOGS,
number_of_lines: int = 0
) -> bool:
"""
Filters a tar.gz Panther input file against the target list of species.
:param directory: str, location of source data file
:param source_filename: str, source data file name
:param target_filename: str, target data file name
:param number_of_lines: int, number of lines parsed; 'all' lines parsed if omitted or set to zero
:return: bool, True if filtering was successful; False if unsuccessful
"""
...
which could be called with default parameter values in the following manner (if invoked from within the Panther data directory):
filter_file()
to generate a pruned down TargetOrthologs.tar.gz
file with target species (as hardcoded in the catalog of species in the ortholog_utils module).
Panther Data Model of Panther Orthologs
Data Field | Content |
---|---|
Gene | species1 | DB=id1 | protdb=pdbid1 |
Ortholog | species2 | DB=id2 | protdb=pdbid2 |
Type of ortholog | [LDO, O, P, X ,LDX] see README. |
Common ancestor for the orthologs | taxon name of common ancestor |
Panther Ortholog ID | Panther (sub)family identifier |
The DB=id#
fields - where DB == database namespace and id# is the object identifier - are directly translated, by internal namespace mapping, into gene CURIEs.
The species#
are abridged labels currently filtered and mapped onto NCBI Taxon identifiers, using an hard-coded dictionary.
Biolink classes and properties captured
- biolink:Gene
- id (NCBIGene Entrez ID)
Note that the Gene source
is currently given as Panther, although the real source of a Gene identifier is given by its CURIE namespace.
- biolink:GeneToGeneHomologyAssociation
- id (random uuid)
- subject (gene.id)
- predicate (orthologous to)
- object (gene.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
Protein Family and Subfamily Classifications - T.B.D.
Contains the PANTHER 16.0 family/subfamily name, with molecular function, biological process, and pathway classifications for every PANTHER protein family and subfamily in the current PANTHER HMM library.
-
Source File: http://data.pantherdb.org/ftp/hmm_classifications/current_release/PANTHER16.0_HMM_classifications
-
Biolink classes and properties captured:
-
biolink:GeneFamily
- id (PANTHER.FAMILY ID)
- source (infores:panther)
-
biolink:MolecularActivity
- id (GO ID)
- source (go)
-
biolink:BiologicalProcess
- id (GO ID)
- source (go)
-
biolink:Pathway
- id (PANTHER.PATHWAY)
- source (infores:panther)
-
biolink:GeneFamilyToMolecularFunctionAssociation
- id (random uuid)
- subject (gene_family.id)
- predicate (enables)
- object (go_term.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
-
biolink:GeneFamilyToBiologicalProcessAssociation
- id (random uuid)
- subject (gene_family.id)
- predicate (involved_in)
- object (go_term.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
-
biolink:GeneFamilyToPathwayAssociation
- id (random uuid)
- subject (gene_family.id)
- predicate (involved_in)
- object (pathway.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
Pathways - T.B.D.
Contains regulatory and metabolic pathways, each with subfamilies and protein sequences mapped to individual pathway components.
-
Source File: http://data.pantherdb.org/ftp/pathway/current_release/SequenceAssociationPathway3.6.5.txt local_name: data/orthology/pathways.tsv
-
Biolink classes and properties captured:
- biolink:GeneFamily
- id (PANTHER.FAMILY ID)
-
source (infores:panther)
-
biolink:Gene
- id (NCBIGene Entrez ID)
- in taxon (NCBITaxon ID)
-
source (infores:entrez)
-
biolink:Pathway
- id (PANTHER.PATHWAY)
-
source (infores:panther)
-
biolink:GeneToPathwayAssociation
- id (random uuid)
- subject (gene.id)
- predicate (involved_in)
- object (pathway.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
-
primary_knowledge_source (infores:panther)
-
biolink:GeneFamilyToPathwayAssociation
- id (random uuid)
- subject (gene_family.id)
- predicate (involved_in)
- object (pathway.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
Sequence Classifications - T.B.D.
Sequence Classifications files contain the PANTHER family, subfamily, molecular function, biological process, and pathway classifications for the complete proteomes derived from the various genomes, indexed by species (one source file per species). Refer to the Sequence Classification README for details.
Only a subset of the available species will be ingested into Monarch at this time, currently: human, mouse, rat, zebrafish, fruit fly, nematode, fission yeast and budding ("baker's") yeast.
-
Source File Directory: http://data.pantherdb.org/ftp/sequence_classifications/current_release/PANTHER_Sequence_Classification_files/
-
Biolink classes and properties captured:
-
biolink:Gene
- id (PANTHER.FAMILY ID)
- source (infores:panther)
-
biolink:GeneFamily
- id (PANTHER.FAMILY ID)
- source (infores:panther)
-
biolink:MolecularActivity
- id (GO ID)
- source (go)
-
biolink:BiologicalProcess
- id (GO ID)
- source (go)
-
biolink:Pathway
- id (PANTHER.PATHWAY)
- source (infores:panther)
-
biolink:GeneToGeneFamilyAssociation:
- id (random uuid)
- subject (gene.id)
- predicate (member_of)
- object (gene_family.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
-
biolink:MacromolecularMachineToMolecularActivityAssociation:
- id (random uuid)
- subject (gene.id)
- predicate (enables)
- object (go_term.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
-
biolink:MacromolecularMachineToBiologicalProcessAssociation:
- id (random uuid)
- subject (gene.id)
- predicate (involved_in)
- object (go_term.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
-
biolink:GeneToPathwayAssociation
- id (random uuid)
- subject (gene.id)
- predicate (involved_in)
- object (pathway.id)
- aggregating_knowledge_source (["infores:monarchinitiative"])
- primary_knowledge_source (infores:panther)
Citation
Paul D. Thomas, Dustin Ebert, Anushya Muruganujan, Tremayne Mushayahama, Laurent-Philippe Albou and Huaiyu Mi Protein Society. 2022;31(1):8-22. doi:10.1002/pro.4218