OntoGPT Functions
All OntoGPT functions are run from the command line.
Precede all commands with ontogpt
.
To see the full list of available commands, run this:
ontogpt --help
Basic Parameters
To see verbose output, run:
ontogpt -v
The options -vv
and -vvv
will enable progressively more verbose output.
cache-db
Use the option --cache-db
to specify a path to a sqlite database to cache the prompt-completion results.
skip-annotator
Use the option --skip-annotator
to skip one or more annotators (e.g. --skip-annotator gilda
).
Common Parameters
The following options are available for most functions unless stated otherwise.
inputfile
Use the option --inputfile
to specify a path to a file containing input text.
For the extract
command, this may be a single file or a directory of files.
In the latter case, all .txt files will be assumed to be input, and the path will not be parsed recursively.
template
Use the option --template
to specify a template to use. This is a required parameter.
Only the name is required, without any filename suffix.
To use the gocam
template, for example, the parameter will be --template gocam
This may be one of the templates included with OntoGPT or a custom template, but in the latter case, the schema, generated Pydantic classes, and any imported schemas should be present in the same location.
target-class
Use the option --target-class
to specify a class in a schema to treat as the root.
If a schema does not already specify a root class, this is required.
Alternatively, the target class can be specified as part of the --template
option, like so: --template mendelian_disease.MendelianDisease
model
Use the option model
to specify the name of a large language model to be used.
For example, this may be --model gpt-4
.
Consult the full list of available models with:
ontogpt list-models
recurse
Use the option recurse
to specify whether recursion should be used when parsing the schema.
Recursion is on by default.
Disable it with --no-recurse
.
use-textract
Use the option use-textract
to specify whether to use the textract package to extract text from the input document. Textract supports retrieving raw text from PDFs, images, audio, and a variety of other formats.
Textract extraction is off by default.
Enable it with --use-textract
.
output
Use the option output
to provide a path to write an output file to.
If this path is not provided, OntoGPT will write to stdout.
output-format
Use the option output-format
to specify the desired output file format.
This may be one of:
- html
- json
- jsonl
- md
- owl
- pickle
- turtle
- yaml
auto-prefix
Use the option auto-prefix
to define a prefix to use for entities without a matching namespace.
When OntoGPT's extract functions find an entity matching the input schema but cannot ground it, the entity will still be included in the output.
By default, these entities will be assigned identifiers like AUTO:tangerine
. If you ground this term to the Food Ontology, however, the entity may be FOODON:00003488
instead.
show-prompt
Use the option show-prompt
to show all prompts constructed and sent to the model. Otherwise, only the final prompt will be shown.
Showing the full prompt is off by default.
Enable it by using --show-prompt
and setting the verbosity level to high (-vvv
).
Functions
categorize-mappings
Categorize a collection of mappings in the Simple Standard for Sharing Ontological Mappings (SSSOM) format.
Mappings in this format may not include their specific mapping types (e.g., broad or close mappings).
This function will attempt to apply more specific mappings wherever possible.
Example:
Using an example SSSOM mapping collection
ontogpt categorize-mappings -i mp-hp-exact-0.0.1.sssom.tsv
Note that OntoGPT will attempt to retrieve the resources specified in the mapping (in the above example, that will include HP and MP). If it cannot find a corresponding resource it will raise a HTTP 404 error.
clinical-notes
Create mock clinical notes.
Options:
-d
,--description TEXT
- a text description of the contents of the generated notes.--sections TEXT
- sections to include in the generated notes, for example, medications, vital signs. Use multiple times for multiple sections, e.g.,--sections medications --sections "vital signs"
Example:
ontogpt clinical-notes -d "middle-aged female patient with syncope and recent travel to the Amazon rainforest"
complete
Prompt completion.
Given the path to a file containing text to continue, this command will simply pass it to the model as a completion task.
Example:
The file example2.txt
contains the text "Here's a good joke about high blood pressure:"
ontogpt complete example2.txt
We take no responsibility for joke quality or lack thereof.
convert
Convert output format.
Rather than a direct format translation, this function performs a full SPIRES extraction on the input file and writes the output in the specified format.
Example:
ontogpt convert -o outputfile.md -O md inputfile.yaml
convert-examples
Convert training examples from YAML.
This can be necessary for performing evaluations.
Given the path to a YAML-format input file containing training examples in a format like this:
---
examples:
- prompt: <text prompt>
completion: <text of completion of the prompt>
- prompt: <another text prompt>
completion: <text of completion of another prompt>
the function will convert it to equivalent JSON.
Example:
ontogpt convert-examples inputfile.yaml
convert-geneset
This command has been deprecated. It is now available through the TALISMAN package at: https://github.com/monarch-initiative/talisman
create-gene-set
This command has been deprecated. It is now available through the TALISMAN package at: https://github.com/monarch-initiative/talisman
diagnose
Diagnose a clinical case represented as one or more Phenopackets.
This function takes one or more file paths as arguments, where each must contain a phenopacket in JSON format.
Example inputs may be found at the Phenomics Exchange repository.
Example:
ontogpt diagnose case1.json case2.json
dump-completions
Dump cached completions.
OntoGPT saves queries and successful text completions to an sqlite database.
Caching is not currently supported for local models.
Use this function to retrieve the contents of this database.
See also: the cache-db
parameter described above.
Options:
-m TEXT
- Match string to use for filtering.-D TEXT
- Path to sqlite database.
Example:
ontogpt dump-completions -m "soup"
embed
Embed text.
This function will return an embedding vector for the input text or texts.
Embedding retrieval is not currently supported for local models.
Options:
-C
,--context TEXT
- domain e.g. anatomy, industry, health-related
Example:
ontogpt embed "obstreperous muskrat"
For OpenAI's "text-embedding-ada-002" model, the output will be a vector of length 1536, like so:
[-0.015013165771961212, -0.013102399185299873, -0.005333086010068655, ...]
enrichment
This command has been deprecated. It is now available through the TALISMAN package at: https://github.com/monarch-initiative/talisman
entity-similarity
Determine similarity between ontology entities by comparing their embeddings.
Options:
-r
,--ontology TEXT
- name of the ontology to use. This should be an OAK adapter name such as "sqlite:obo:hp".--definitions
/--no-definitions
- Include text definitions in the text to embed. Defaults to True.--parents
/--no-parents
- Include is-a parent terms in the text to embed. Defaults to True.--ancestors
/--no-ancestors
- Include all ancestors in the text to embed. Defaults to True.--logical-definitions
/--no-logical-definitions
- Include logical definitions in the text to embed. Defaults to True.--autolabel
/--no-autolabel
- Add labels to each subject and object identifier. Defaults to True.--synonyms
/--no-synonyms
- Include synonyms in the text to embed. Defaults to True.
Example:
ontogpt entity-similarity -r sqlite:obo:hp HP:0012228 HP:0000629
In this case, the output will look like this:
subject_id subject_label object_id object_label embedding_cosine_similarity object_rank_for_subject
HP:0012228 Tension-type headache HP:0012228 Tension-type headache 0.9999999999999999 0
HP:0012228 Tension-type headache HP:0000629 Periorbital fullness 0.7755551231762359 1
HP:0000629 Periorbital fullness HP:0000629 Periorbital fullness 1.0000000000000002 0
HP:0000629 Periorbital fullness HP:0012228 Tension-type headache 0.7755551231762359 1
eval
Evaluate an extractor.
See the Evaluations section for more details.
Options:
--num-tests INTEGER
- number of test iterations to cycle through. Defaults to 5.--chunking
/--no-chunking
- If set, chunk input text, then prepare a separate prompt for each chunk. Otherwise the full input text is passed. Defaults to False.
Example:
ontogpt eval --num-tests 1 EvalCTD
eval-enrichment
This command has been deprecated. It is now available through the TALISMAN package at: https://github.com/monarch-initiative/talisman
extract
Extract knowledge from text guided by a schema.
This is OntoGPT's implementation of SPIRES.
Output includes the input text (or a truncated part), the raw completion output, the prompt (specifically, the last iteration of the prompts used), and an extracted object containing all parts identified in the input text, as well as a list of named entities and their labels.
Options:
-S
,--set-slot-value TEXT
- Set slot value manually, e.g.,--set-slot-value has_participant=protein
Examples:
ontogpt extract -t gocam.GoCamAnnotations -i tests/input/cases/gocam-33246504.txt
In this case, you will an extracted object in the output like:
extracted_object:
genes:
- HGNC:5992
- AUTO:F4/80
- HGNC:16400
- HGNC:1499
- HGNC:5992
- HGNC:5993
organisms:
- NCBITaxon:10088
- AUTO:bone%20marrow-derived%20macrophages
- AUTO:astrocytes
- AUTO:bipolar%20cells
- AUTO:vascular%20cells
- AUTO:perivascular%20MPs
gene_organisms:
- gene: HGNC:5992
organism: AUTO:mononuclear%20phagocytes
- gene: HGNC:16400
organism: AUTO:F4/80%2B%20mononuclear%20phagocytes
- gene: HGNC:1499
organism: AUTO:F4/80%2B%20mononuclear%20phagocytes
- gene: HGNC:5992
organism: AUTO:perivascular%20macrophages
- gene: HGNC:5993
organism: AUTO:None
activities:
- GO:0006954
- AUTO:photoreceptor%20death
- AUTO:retinal%20function
gene_functions:
- gene: HGNC:5992
molecular_activity: GO:0006954
- gene: AUTO:F4/80
molecular_activity: AUTO:mononuclear%20phagocyte%20recruitment
- gene: HGNC:1499
molecular_activity: GO:0006954
- gene: HGNC:5992
molecular_activity: AUTO:immune-specific%20expression
- gene: HGNC:5993
molecular_activity: AUTO:IL-1%CE%B2%20receptor
- gene: AUTO:rytvela
molecular_activity: AUTO:IL-1R%20modulation
- gene: AUTO:Kineret
molecular_activity: AUTO:IL-1R%20antagonism
cellular_processes:
- AUTO:macrophage-induced%20photoreceptor%20death
gene_localizations:
- gene: HGNC:5992
location: AUTO:subretinal%20space
Or, we can extract information about a drug and specify which model to use:
ontogpt extract -t drug -i tests/input/cases/drug-DB00316-moa.txt --auto-prefix UNKNOWN -m gpt-4
The ontology_class
schema may be used to perform more domain-agnostic entity recognition, though this is generally incompatible with grounding.
ontogpt extract -t ontology_class -i tests/input/cases/human_urban_green_space.txt
fill
Fill in missing values.
Requires the path to a file containing a data object to be passed (as an argument) and a set of examples as an input file.
Options:
-E
,--examples FILENAME
- Path to a file of example objects.
generate-extract
Generate text and then extract knowledge from it.
This command runs two operations:
- Generate a natural language description of something
- Parse the generated description using SPIRES
For example, given a cell type such as Acinar Cell Of Salivary Gland, generate a description using GPT describing many aspects of the cell type, from its marker genes through to its function and diseases it is implicated in.
After that, use the cell-type schema to extract this into structured form.
As an optional next step use linkml-owl to generate OWL TBox axioms.
See also: iteratively-generate-extract
below.
Example:
ontogpt generate-extract -t cell_type CL:0002623
iteratively-generate-extract
Iterate through generate-extract.
This runs the generate-extract
command in iterative mode. It will traverse the extracted subtypes with each iteration, gradually building up an ontology that is entirely generated from the "latent knowledge" in the LLM.
Currently each iteration is independent so the method remains unaware as to whether it has already made a concept. Ungrounded concepts may indicate gaps in available knowledgebases.
Unlike the generate-extract
command, this command requires some additional parameters to be specified.
Please specify the input ontology and the output path.
Options:
-r
,--ontology TEXT
- Ontology to use. Use the OAK selector format, e.g., "sqlite:obo:cl"-M
,--max-iterations INTEGER
- Maximum number of iterations.-I
,--iteration-slot TEXT
- Slots to iterate over.-D
,--db TEXT
- Path to the output, in YAML format.--clear
/--no-clear
- If set, clear the output database before starting. Defaults to False.
Example:
ontogpt iteratively-generate-extract -t cell_type -r sqlite:obo:cl -D cells.yaml CL:0002623
list-models
List all available models.
Example:
ontogpt list-models
list-templates
List the templates.
Alternatively, run make list_templates
.
Example:
ontogpt list-templates
pubmed-annotate
Retrieve a collection of PubMed IDs for a given search, then perform extraction on them with SPIRES.
The search argument will accept all parameters known to PubMed search, such as filtering by publication year.
Works for single publications, too - set the --limit
parameter to 1 and specify a PubMed ID as the search argument.
Options:
--limit INTEGER
- Total number of citation records to return. Limited by the NCBI API.--get-pmc
/--no-get-pmc
- Attempt to parse PubMed Central full text(s) rather than abstract(s) alone.
Examples:
ontogpt pubmed-annotate -t phenotype "Takotsubo Cardiomyopathy: A Brief Review" --get-pmc --model gpt-3.5-turbo-16k --limit 3
ontogpt pubmed-annotate -t environmental_sample "33126925" --limit 1
ontogpt pubmed-annotate -t composite_disease "(earplugs) AND (("1950"[Date - Publication] : "1990"[Date - Publication]))" --limit 4
pubmed-extract
Extract knowledge from a single PubMed ID.
DEPRECATED - use pubmed-annotate
instead.
recipe-extract
Extract from a recipe on the web.
This uses the recipe
template and the recipe_scrapers package. The latter supports many different recipe web sites, so give your favorite a try.
Pass a URL as the argument, or use the -R option to specify the path to a file containing one URL per line.
Options:
-R
,--recipes-urls-file TEXT
- File with URLs to recipes to use for extraction.
Example:
ontogpt recipe-extract https://www.allrecipes.com/recipe/17445/grilled-asparagus/
In this case, expect an extracted object like the following:
extracted_object:
url: https://www.allrecipes.com/recipe/17445/grilled-asparagus/
label: Grilled Asparagus
description: Grilled asparagus with olive oil, salt, and pepper.
categories:
- AUTO:None
ingredients:
- food_item:
food: FOODON:03311349
state: fresh, spears
amount:
value: '1'
unit: UO:0010034
- food_item:
food: FOODON:03301826
amount:
value: '1'
unit: UO:0010042
- food_item:
food: AUTO:salt
state: and pepper
amount:
value: N/A
unit: AUTO:N/A
steps:
- action: AUTO:Preheat
inputs:
- food: AUTO:outdoor%20grill
state: None
outputs:
- food: AUTO:None
state: None
utensils:
- AUTO:None
- action: dbpediaont:season
inputs:
- food: FOODON:00003458
state: coated
- food: AUTO:salt
state: None
- food: FOODON:00003520
outputs:
- food: FOODON:00003458
state: seasoned
utensils:
- AUTO:None
- action: AUTO:cook
inputs:
- food: FOODON:03311349
state: None
outputs:
- food: FOODON:03311349
state: cooked
utensils:
- AUTO:grill
suggest-template
Suggest a template for a stated topic or purpose.
This uses an LLM, so all options supported by the extract
or complete
functions are usable here as well.
Example:
ontogpt suggest-template "Takotsubo Cardiomyopathy"
In this case, expect a response like this:
The most appropriate template for the topic of Takotsubo Cardiomyopathy would be the composite_disease template. This template is specifically designed for representing composite disease concepts, which fits well with the complex nature of Takotsubo Cardiomyopathy.
synonyms
Extract synonyms, based on embeddings.
The context parameter is required.
Options:
-C
,--context TEXT
- domain, e.g., anatomy, industry, health-related
Example:
ontogpt synonyms --context astronomy star
text-distance
Embed text and calculate euclidian distance between the embeddings.
The terms must be separated by an @
character.
Options:
-C
,--context TEXT
- domain, e.g., anatomy, industry, health-related
Example:
ontogpt text-distance pancakes @ syrup
text-similarity
Like text-distance
, this command compares the embeddings of input terms.
This command returns the cosine similarity of the embedding vectors.
Options:
-C
,--context TEXT
- domain, e.g., anatomy, industry, health-related
Example:
ontogpt text-similarity basketball @ basket-weaving
web-extract
Extract knowledge from web page.
Pass a URL as an argument and OntoGPT will use the SPIRES method to extract information based on the specified template.
Because this depends upon scraping a page, results may vary depending on a site's complexity and structure.
Even relatively short pages may exceed a model's context size, so larger context models may be necessary.
Example:
ontogpt web-extract -t reaction.Reaction -m gpt-3.5-turbo-16k https://www.scienceofcooking.com/maillard_reaction.htm
wikipedia-extract
Extract knowledge from a Wikipedia page.
Pass an article title as an argument and OntoGPT will use the SPIRES method to extract information based on the specified template.
Even relatively short pages may exceed a model's context size, so larger context models may be necessary.
Example:
ontogpt wikipedia-extract -t mendelian_disease.MendelianDisease -m gpt-3.5-turbo-16k "Cartilage–hair hypoplasia"
wikipedia-search
Extract knowledge from Wikipedia pages based on a search.
Pass a search phrase as an argument and OntoGPT will use the SPIRES method to extract information based on the specified template.
Even relatively short pages may exceed a model's context size, so larger context models may be necessary.
Example:
ontogpt wikipedia-search -t biological_process -m gpt-3.5-turbo-16k "digestion"