Engine Preferences

Engines in monarch provide an abstraction over a Knowledge Graph (KG), supporting fetching nodes, expanding, and summaries. When instantiating an engine, we can optionally provide a preferences list to adjust default behaviors.

To see the default preferences for an engine, we can simply inspect its preferences entry:

library(monarchr)

e <- monarch_engine()
str(e$preferences)
## List of 4
##  $ category_priority     : chr [1:36] "biolink:LifeStage" "biolink:MolecularEntity" "biolink:OrganismTaxon" "biolink:Cell" ...
##  $ node_property_priority: chr [1:8] "id" "pcategory" "name" "symbol" ...
##  $ edge_property_priority: chr [1:4] "subject" "predicate" "object" "primary_knowledge_source"
##  $ monarch_api_url       : chr "https://api.monarchinitiative.org/v3/api"

Category Priority

The most important entry here is category_priority. KGX-formatted KGs must label nodes with a multi-valued category. In some KGs this may contain a single value; a Gene for example may be labeled with only c("biolink:Gene"). In other KGs, including the Monarch KG, entities may be labeled with multiple categories, for example c("biolink:GenomicEntity", "biolink:Entity", "biolink:Gene", "biolink:NamedThing"). KGX does not specify an order for these (though in KGX Biolink labels exist in a hierarchy, a biolink:Gene is a type of biolink:GenomicEntity, which is a type of biolink:Entity, and so on).

In practice, however, a single category is typically most relevant. For Genes we usually care about their biolink:Gene category, diseases biolink:Disease, and so on. In monarchr, this “primary category” is represented by nodes’ pcategory column for convenience:

library(dplyr)

data(eds_marfan_kg)

g <- eds_marfan_kg |> fetch_nodes(query_ids = "HP:0001788") |> 
  expand(categories = "biolink:Disease") |>
  expand(categories = "biolink:Gene")

nodes(g) |> select(id, name, pcategory, category)
## # A tibble: 10 × 4
##    id            name                                          pcategory                 category  
##    <chr>         <chr>                                         <chr>                     <list>    
##  1 HP:0001788    Premature rupture of membranes                biolink:PhenotypicFeature <chr [6]> 
##  2 MONDO:0016002 Ehlers-Danlos syndrome, kyphoscoliotic type 1 biolink:Disease           <chr [6]> 
##  3 MONDO:0009161 Ehlers-Danlos syndrome, dermatosparaxis type  biolink:Disease           <chr [6]> 
##  4 MONDO:0007522 Ehlers-Danlos syndrome, classic type          biolink:Disease           <chr [6]> 
##  5 HGNC:2197     COL1A1                                        biolink:Gene              <chr [12]>
##  6 HGNC:2209     COL5A1                                        biolink:Gene              <chr [12]>
##  7 HGNC:9081     PLOD1                                         biolink:Gene              <chr [12]>
##  8 HGNC:2210     COL5A2                                        biolink:Gene              <chr [12]>
##  9 HGNC:218      ADAMTS2                                       biolink:Gene              <chr [12]>
## 10 HGNC:14631    ADAMTSL2                                      biolink:Gene              <chr [12]>

Which entry of category is chosen for pcategory is determined by the engines’ $preferences$category_priority. For each node, the first entry that is present in category is used, if none are, the first entry of category is.

We can adjust this when initializing the engine. To do so in this example we need to load the KG from file with file_engine().

filename <- filename <- system.file("extdata", "eds_marfan_kg.tar.gz", package = "monarchr")

eds_marfan_kg <- file_engine(filename, 
                             preferences = list(category_priority = c("biolink:GenomicEntity", 
                                                                                                                             "biolink:DiseaseOrPhenotypicFeature")))

g <- eds_marfan_kg |> fetch_nodes(query_ids = "HP:0001788") |> 
  expand(categories = "biolink:Disease") |>
  expand(categories = "biolink:Gene")

nodes(g) |> select(id, name, pcategory, category)
## # A tibble: 10 × 4
##    id            name                                          pcategory                          category  
##    <chr>         <chr>                                         <chr>                              <list>    
##  1 HP:0001788    Premature rupture of membranes                biolink:DiseaseOrPhenotypicFeature <chr [6]> 
##  2 MONDO:0016002 Ehlers-Danlos syndrome, kyphoscoliotic type 1 biolink:DiseaseOrPhenotypicFeature <chr [6]> 
##  3 MONDO:0009161 Ehlers-Danlos syndrome, dermatosparaxis type  biolink:DiseaseOrPhenotypicFeature <chr [6]> 
##  4 MONDO:0007522 Ehlers-Danlos syndrome, classic type          biolink:DiseaseOrPhenotypicFeature <chr [6]> 
##  5 HGNC:2197     COL1A1                                        biolink:GenomicEntity              <chr [12]>
##  6 HGNC:2209     COL5A1                                        biolink:GenomicEntity              <chr [12]>
##  7 HGNC:9081     PLOD1                                         biolink:GenomicEntity              <chr [12]>
##  8 HGNC:2210     COL5A2                                        biolink:GenomicEntity              <chr [12]>
##  9 HGNC:218      ADAMTS2                                       biolink:GenomicEntity              <chr [12]>
## 10 HGNC:14631    ADAMTSL2                                      biolink:GenomicEntity              <chr [12]>

The default category_priority list is designed to preferentially assign most-specific categories to pcategory and should work well in most uses.

Other Preferences

Other preferences common to both file_engines and neo4j_engines include:

  • node_property_priority - defines a subset of column names to include first in node data frames
  • egde_property_priority - defines a subset of column names to include first in edge data frames

Query Caching for Neo4j Engines

By default, neo4j_engines and monarch_engines cache queries for the duration of the R session, speeding exploratory analyses. This can be disabled, but is not (currently) controlled by preferences, and is instead a parameter to the engine constructor:

monarch <- monarch_engine(cache = FALSE)

g1 <- monarch |> fetch_nodes(query_ids = "HP:0001788") |>
  expand()

g2 <- monarch |> fetch_nodes(query_ids = "HP:0001788") |>
  expand()

In the above example, because caching is disabled, the fetch and expansion are re-run in computing g2.