vignettes/examples/engine_preferences.Rmd
engine_preferences.Rmd
Engines in monarch
provide an abstraction over a
Knowledge Graph (KG), supporting fetching nodes, expanding, and
summaries. When instantiating an engine, we can optionally provide a
preferences
list to adjust default behaviors.
To see the default preferences for an engine, we can simply inspect
its preferences
entry:
library(monarchr)
e <- monarch_engine()
str(e$preferences)
## List of 4
## $ category_priority : chr [1:36] "biolink:LifeStage" "biolink:MolecularEntity" "biolink:OrganismTaxon" "biolink:Cell" ...
## $ node_property_priority: chr [1:8] "id" "pcategory" "name" "symbol" ...
## $ edge_property_priority: chr [1:4] "subject" "predicate" "object" "primary_knowledge_source"
## $ monarch_api_url : chr "https://api.monarchinitiative.org/v3/api"
The most important entry here is category_priority
.
KGX-formatted KGs must label nodes with a multi-valued
category
. In some KGs this may contain a single value; a
Gene for example may be labeled with only
c("biolink:Gene")
. In other KGs, including the Monarch KG,
entities may be labeled with multiple categories, for example
c("biolink:GenomicEntity", "biolink:Entity", "biolink:Gene", "biolink:NamedThing")
.
KGX does not specify an order for these (though in KGX Biolink labels
exist in a hierarchy, a biolink:Gene
is a type of
biolink:GenomicEntity
, which is a type of
biolink:Entity
, and so on).
In practice, however, a single category is typically most relevant.
For Genes we usually care about their biolink:Gene
category, diseases biolink:Disease
, and so on. In
monarchr
, this “primary category” is represented by nodes’
pcategory
column for convenience:
library(dplyr)
data(eds_marfan_kg)
g <- eds_marfan_kg |> fetch_nodes(query_ids = "HP:0001788") |>
expand(categories = "biolink:Disease") |>
expand(categories = "biolink:Gene")
nodes(g) |> select(id, name, pcategory, category)
## # A tibble: 10 × 4
## id name pcategory category
## <chr> <chr> <chr> <list>
## 1 HP:0001788 Premature rupture of membranes biolink:PhenotypicFeature <chr [6]>
## 2 MONDO:0016002 Ehlers-Danlos syndrome, kyphoscoliotic type 1 biolink:Disease <chr [6]>
## 3 MONDO:0009161 Ehlers-Danlos syndrome, dermatosparaxis type biolink:Disease <chr [6]>
## 4 MONDO:0007522 Ehlers-Danlos syndrome, classic type biolink:Disease <chr [6]>
## 5 HGNC:2197 COL1A1 biolink:Gene <chr [12]>
## 6 HGNC:2209 COL5A1 biolink:Gene <chr [12]>
## 7 HGNC:9081 PLOD1 biolink:Gene <chr [12]>
## 8 HGNC:2210 COL5A2 biolink:Gene <chr [12]>
## 9 HGNC:218 ADAMTS2 biolink:Gene <chr [12]>
## 10 HGNC:14631 ADAMTSL2 biolink:Gene <chr [12]>
Which entry of category
is chosen for
pcategory
is determined by the engines’
$preferences$category_priority
. For each node, the first
entry that is present in category
is used, if none are, the
first entry of category
is.
We can adjust this when initializing the engine. To do so in this
example we need to load the KG from file with
file_engine()
.
filename <- filename <- system.file("extdata", "eds_marfan_kg.tar.gz", package = "monarchr")
eds_marfan_kg <- file_engine(filename,
preferences = list(category_priority = c("biolink:GenomicEntity",
"biolink:DiseaseOrPhenotypicFeature")))
g <- eds_marfan_kg |> fetch_nodes(query_ids = "HP:0001788") |>
expand(categories = "biolink:Disease") |>
expand(categories = "biolink:Gene")
nodes(g) |> select(id, name, pcategory, category)
## # A tibble: 10 × 4
## id name pcategory category
## <chr> <chr> <chr> <list>
## 1 HP:0001788 Premature rupture of membranes biolink:DiseaseOrPhenotypicFeature <chr [6]>
## 2 MONDO:0016002 Ehlers-Danlos syndrome, kyphoscoliotic type 1 biolink:DiseaseOrPhenotypicFeature <chr [6]>
## 3 MONDO:0009161 Ehlers-Danlos syndrome, dermatosparaxis type biolink:DiseaseOrPhenotypicFeature <chr [6]>
## 4 MONDO:0007522 Ehlers-Danlos syndrome, classic type biolink:DiseaseOrPhenotypicFeature <chr [6]>
## 5 HGNC:2197 COL1A1 biolink:GenomicEntity <chr [12]>
## 6 HGNC:2209 COL5A1 biolink:GenomicEntity <chr [12]>
## 7 HGNC:9081 PLOD1 biolink:GenomicEntity <chr [12]>
## 8 HGNC:2210 COL5A2 biolink:GenomicEntity <chr [12]>
## 9 HGNC:218 ADAMTS2 biolink:GenomicEntity <chr [12]>
## 10 HGNC:14631 ADAMTSL2 biolink:GenomicEntity <chr [12]>
The default category_priority
list is designed to
preferentially assign most-specific categories to pcategory
and should work well in most uses.
Other preferences common to both file_engine
s and
neo4j_engine
s include:
node_property_priority
- defines a subset of column
names to include first in node data framesegde_property_priority
- defines a subset of column
names to include first in edge data framesBy default, neo4j_engine
s and
monarch_engine
s cache queries for the duration of the R
session, speeding exploratory analyses. This can be disabled, but is not
(currently) controlled by preferences, and is instead a parameter to the
engine constructor:
monarch <- monarch_engine(cache = FALSE)
g1 <- monarch |> fetch_nodes(query_ids = "HP:0001788") |>
expand()
g2 <- monarch |> fetch_nodes(query_ids = "HP:0001788") |>
expand()
In the above example, because caching is disabled, the fetch and
expansion are re-run in computing g2
.