OntoGPT Functions
All OntoGPT functions are run from the command line.
Precede all commands with ontogpt
.
To see the full list of available commands, run this:
ontogpt --help
Basic Parameters
To see verbose output, run:
ontogpt -v
The options -vv
and -vvv
will enable progressively more verbose output.
cache-db
Use the option --cache-db
to specify a path to a sqlite database to cache the prompt-completion results.
skip-annotator
Use the option --skip-annotator
to skip one or more annotators (e.g. --skip-annotator gilda
).
Common Parameters
The following options are available for most functions unless stated otherwise.
inputfile
Use the option --inputfile
to specify a path to a file containing input text.
For the extract
command, this may be a single file or a directory of files.
In the latter case, all files in the following formats will be assumed to be input:
".csv", ".tsv", ".txt", ".od", ".odf", ".ods", ".pdf", ".xls", ".xlsx"
The path will not be parsed recursively.
When parsing PDF files, use the use-pdf
option as described below.
When parsing tabular files like tsv or xlsx, you may specify exact columns to load with the selectcols
option as described below.
template
Use the option --template
to specify a template to use. This is a required parameter.
Only the name is required, without any filename suffix, unless you are using a custom schema for the first time. In that case, provide the path to the schema, including the .yaml file extension.
To use the gocam
template, for example, the parameter will be --template gocam
Or, for a custom template, the parameter may be --template custom.yaml
target-class
Use the option --target-class
to specify a class in a schema to treat as the root.
If a schema does not already specify a root class, this is required.
Alternatively, the target class can be specified as part of the --template
option, like so: --template mendelian_disease.MendelianDisease
model
Use the option model
to specify the name of a large language model to be used.
For example, this may be --model gpt-4
.
Consult the full list of available models with:
ontogpt list-models
Note this will list all models available across a variety of sources - API keys will still be required to access many.
The name in the first column is the model name to be used with the --model
option, e.g., --model mistral/mistral-medium
.
recurse
Use the option recurse
to specify whether recursion should be used when parsing the schema.
Recursion is on by default.
Disable it with --no-recurse
.
use-pdf
Use the option use-pdf
to specify whether to extract text from a PDF.
This is done through the pymupdf
package.
Example:
ontogpt extract -i temp/test1.pdf -m gpt-4o --use-pdf -t composite_disease --output-format json -o test.json
output
Use the option output
to provide a path to write an output file to.
If this path is not provided, OntoGPT will write to stdout.
output-format
Use the option output-format
to specify the desired output file format.
This may be one of:
- html
- json
- jsonl
- md
- owl
- pickle
- turtle
- yaml
auto-prefix
Use the option auto-prefix
to define a prefix to use for entities without a matching namespace.
When OntoGPT's extract functions find an entity matching the input schema but cannot ground it, the entity will still be included in the output.
By default, these entities will be assigned identifiers like AUTO:tangerine
. If you ground this term to the Food Ontology, however, the entity may be FOODON:00003488
instead.
show-prompt
Use the option show-prompt
to show all prompts constructed and sent to the model. Otherwise, only the final prompt will be shown.
Showing the full prompt is off by default.
Enable it by using --show-prompt
and setting the verbosity level to high (-vvv
).
api-base
Use the option api-base
to specify a base URL to use for your LLM API, e.g. for the Azure OpenAI API.
Note this may also be set through the runoak set-apikey command (e.g., runoak set-apikey -e azure-base
)
api-version
Use the option api-version
to specify an API version to use for your LLM API, e.g. for the Azure OpenAI API.
Note this may also be set through the runoak set-apikey command (e.g., runoak set-apikey -e azure-version
)
model-provider
Use the option model-provider
to specify a provider if not specified in the model name.
For example, if using a proxy using the OpenAI API format, this should be set to 'openai'
temperature
Use the option temperature
to specify the temperature the model should generate using.
The range may vary per model, but for the OpenAI API this value must range from 0 to 2, with a default of 1.0.
Higher temperatures generally correspond to greater randomness, which may be desirable.
cut-input-text
Use the option cut-input-text
to truncate all input_text in output to 1000 characters each.
This can be useful when processing many inputs and/or full texts, as without this option the full input will be included.
system-message
Use the option system-message
to pass a system-level message to the LLM.
For example, with an input file named greeting.txt
containing "How are you today":
$ ontogpt complete -m llama-3 -i greeting.txt
I'm just a language model, I don't have emotions or feelings like humans do, so I don't have good or bad days. I'm always here and ready to chat with you, 24/7! How can I assist you today?
$ ontogpt complete -m llama-3 -i greeting.txt --system-message "You are very grumpy today"
*grumble grumble* I'm terrible, thanks for asking. Everything is just so... annoying. The sun is shining too brightly, the birds are singing too loudly, and the air is filled with the scent of... *sigh*... everything. Just, ugh. Can't anyone just leave me alone for once? What's it to you, anyway? *mutter mutter*
Including an instruction like the following anecdotally helps to avoid parsing failures due to the LLM getting creative with result formatting:
--system-message "You are going to extract information from text in the specified format. You will not deviate from the format; do not provide results in JSON format."
selectcols
Use the option selectcols
to specify exact colums to use when parsing tabular files as input.
Example:
ontogpt extract -t food -i inputs/myfile.tsv -o output.yaml --selectcols cheeses,grapes,flavors
Functions
categorize-mappings
Categorize a collection of mappings in the Simple Standard for Sharing Ontological Mappings (SSSOM) format.
Mappings in this format may not include their specific mapping types (e.g., broad or close mappings).
This function will attempt to apply more specific mappings wherever possible.
Example:
Using an example SSSOM mapping collection
ontogpt categorize-mappings -i mp-hp-exact-0.0.1.sssom.tsv
Note that OntoGPT will attempt to retrieve the resources specified in the mapping (in the above example, that will include HP and MP). If it cannot find a corresponding resource it will raise a HTTP 404 error.
clinical-notes
Create mock clinical notes.
Options:
-d
,--description TEXT
- a text description of the contents of the generated notes.--sections TEXT
- sections to include in the generated notes, for example, medications, vital signs. Use multiple times for multiple sections, e.g.,--sections medications --sections "vital signs"
Example:
ontogpt clinical-notes -d "middle-aged female patient with syncope and recent travel to the Amazon rainforest"
complete
Prompt completion.
Given the path to a file containing text to continue, this command will simply pass it to the model as a completion task.
Example:
The file example2.txt
contains the text "Here's a good joke about high blood pressure:"
ontogpt complete example2.txt
We take no responsibility for joke quality or lack thereof.
convert
Convert output format.
Rather than a direct format translation, this function performs a full SPIRES extraction on the input file and writes the output in the specified format.
Example:
ontogpt convert -o outputfile.md -O md inputfile.yaml
convert-examples
Convert training examples from YAML.
This can be necessary for performing evaluations.
Given the path to a YAML-format input file containing training examples in a format like this:
---
examples:
- prompt: <text prompt>
completion: <text of completion of the prompt>
- prompt: <another text prompt>
completion: <text of completion of another prompt>
the function will convert it to equivalent JSON.
Example:
ontogpt convert-examples inputfile.yaml
convert-geneset
This command has been deprecated. It is now available through the TALISMAN package at: https://github.com/monarch-initiative/talisman
create-gene-set
This command has been deprecated. It is now available through the TALISMAN package at: https://github.com/monarch-initiative/talisman
diagnose
Diagnose a clinical case represented as one or more Phenopackets.
This function takes one or more file paths as arguments, where each must contain a phenopacket in JSON format.
Example inputs may be found at the Phenomics Exchange repository.
Example:
ontogpt diagnose case1.json case2.json
dump-completions
Dump cached completions.
OntoGPT saves queries and successful text completions to an sqlite database.
Caching is not currently supported for local models.
Use this function to retrieve the contents of this database.
See also: the cache-db
parameter described above.
Options:
-m TEXT
- Match string to use for filtering.-D TEXT
- Path to sqlite database.
Example:
ontogpt dump-completions -m "soup"
embed
Embed text.
This function will return an embedding vector for the input text or texts.
Embedding retrieval is not currently supported for local models.
Options:
-C
,--context TEXT
- domain e.g. anatomy, industry, health-related
Example:
ontogpt embed "obstreperous muskrat"
For OpenAI's "text-embedding-ada-002" model, the output will be a vector of length 1536, like so:
[-0.015013165771961212, -0.013102399185299873, -0.005333086010068655, ...]
enrichment
This command has been deprecated. It is now available through the TALISMAN package at: https://github.com/monarch-initiative/talisman
entity-similarity
Determine similarity between ontology entities by comparing their embeddings.
Options:
-r
,--ontology TEXT
- name of the ontology to use. This should be an OAK adapter name such as "sqlite:obo:hp".--definitions
/--no-definitions
- Include text definitions in the text to embed. Defaults to True.--parents
/--no-parents
- Include is-a parent terms in the text to embed. Defaults to True.--ancestors
/--no-ancestors
- Include all ancestors in the text to embed. Defaults to True.--logical-definitions
/--no-logical-definitions
- Include logical definitions in the text to embed. Defaults to True.--autolabel
/--no-autolabel
- Add labels to each subject and object identifier. Defaults to True.--synonyms
/--no-synonyms
- Include synonyms in the text to embed. Defaults to True.
Example:
ontogpt entity-similarity -r sqlite:obo:hp HP:0012228 HP:0000629
In this case, the output will look like this:
subject_id subject_label object_id object_label embedding_cosine_similarity object_rank_for_subject
HP:0012228 Tension-type headache HP:0012228 Tension-type headache 0.9999999999999999 0
HP:0012228 Tension-type headache HP:0000629 Periorbital fullness 0.7755551231762359 1
HP:0000629 Periorbital fullness HP:0000629 Periorbital fullness 1.0000000000000002 0
HP:0000629 Periorbital fullness HP:0012228 Tension-type headache 0.7755551231762359 1
eval
Evaluate an extractor.
See the Evaluations section for more details.
Options:
--num-tests INTEGER
- number of test iterations to cycle through. Defaults to 5.--chunking
/--no-chunking
- If set, chunk input text, then prepare a separate prompt for each chunk. Otherwise the full input text is passed. Defaults to False.
Example:
ontogpt eval --num-tests 1 EvalCTD
eval-enrichment
This command has been deprecated. It is now available through the TALISMAN package at: https://github.com/monarch-initiative/talisman
extract
Extract knowledge from text guided by a schema.
This is OntoGPT's implementation of SPIRES.
Output includes the input text (or a truncated part), the raw completion output, the prompt (specifically, the last iteration of the prompts used), and an extracted object containing all parts identified in the input text, as well as a list of named entities and their labels.
Options:
-S
,--set-slot-value TEXT
- Set slot value manually, e.g.,--set-slot-value has_participant=protein
Examples:
ontogpt extract -t gocam.GoCamAnnotations -i tests/input/cases/gocam-33246504.txt
In this case, you will an extracted object in the output like:
extracted_object:
genes:
- HGNC:5992
- AUTO:F4/80
- HGNC:16400
- HGNC:1499
- HGNC:5992
- HGNC:5993
organisms:
- NCBITaxon:10088
- AUTO:bone%20marrow-derived%20macrophages
- AUTO:astrocytes
- AUTO:bipolar%20cells
- AUTO:vascular%20cells
- AUTO:perivascular%20MPs
gene_organisms:
- gene: HGNC:5992
organism: AUTO:mononuclear%20phagocytes
- gene: HGNC:16400
organism: AUTO:F4/80%2B%20mononuclear%20phagocytes
- gene: HGNC:1499
organism: AUTO:F4/80%2B%20mononuclear%20phagocytes
- gene: HGNC:5992
organism: AUTO:perivascular%20macrophages
- gene: HGNC:5993
organism: AUTO:None
activities:
- GO:0006954
- AUTO:photoreceptor%20death
- AUTO:retinal%20function
gene_functions:
- gene: HGNC:5992
molecular_activity: GO:0006954
- gene: AUTO:F4/80
molecular_activity: AUTO:mononuclear%20phagocyte%20recruitment
- gene: HGNC:1499
molecular_activity: GO:0006954
- gene: HGNC:5992
molecular_activity: AUTO:immune-specific%20expression
- gene: HGNC:5993
molecular_activity: AUTO:IL-1%CE%B2%20receptor
- gene: AUTO:rytvela
molecular_activity: AUTO:IL-1R%20modulation
- gene: AUTO:Kineret
molecular_activity: AUTO:IL-1R%20antagonism
cellular_processes:
- AUTO:macrophage-induced%20photoreceptor%20death
gene_localizations:
- gene: HGNC:5992
location: AUTO:subretinal%20space
Or, we can extract information about a drug and specify which model to use:
ontogpt extract -t drug -i tests/input/cases/drug-DB00316-moa.txt --auto-prefix UNKNOWN -m gpt-4
The ontology_class
schema may be used to perform more domain-agnostic entity recognition, though this is generally incompatible with grounding.
ontogpt extract -t ontology_class -i tests/input/cases/human_urban_green_space.txt
fill
Fill in missing values.
Requires the path to a file containing a data object to be passed (as an argument) and a set of examples as an input file.
Options:
-E
,--examples FILENAME
- Path to a file of example objects.
generate-extract
Generate text and then extract knowledge from it.
This command runs two operations:
- Generate a natural language description of something
- Parse the generated description using SPIRES
For example, given a cell type such as Acinar Cell Of Salivary Gland, generate a description using GPT describing many aspects of the cell type, from its marker genes through to its function and diseases it is implicated in.
After that, use the cell-type schema to extract this into structured form.
As an optional next step use linkml-owl to generate OWL TBox axioms.
See also: iteratively-generate-extract
below.
Example:
ontogpt generate-extract -t cell_type CL:0002623
iteratively-generate-extract
Iterate through generate-extract.
This runs the generate-extract
command in iterative mode. It will traverse the extracted subtypes with each iteration, gradually building up an ontology that is entirely generated from the "latent knowledge" in the LLM.
Currently each iteration is independent so the method remains unaware as to whether it has already made a concept. Ungrounded concepts may indicate gaps in available knowledgebases.
Unlike the generate-extract
command, this command requires some additional parameters to be specified.
Please specify the input ontology and the output path.
Options:
-r
,--ontology TEXT
- Ontology to use. Use the OAK selector format, e.g., "sqlite:obo:cl"-M
,--max-iterations INTEGER
- Maximum number of iterations.-I
,--iteration-slot TEXT
- Slots to iterate over.-D
,--db TEXT
- Path to the output, in YAML format.--clear
/--no-clear
- If set, clear the output database before starting. Defaults to False.
Example:
ontogpt iteratively-generate-extract -t cell_type -r sqlite:obo:cl -D cells.yaml CL:0002623
list-models
List all available models. These will include all models litellm
knows about, but may not include all local models present on your system (use ollama list
to see those).
Example:
ontogpt list-models
list-templates
List the templates.
Alternatively, run make list_templates
.
Example:
ontogpt list-templates
pubmed-annotate
Retrieve a collection of PubMed IDs for a given search, then perform extraction on them with SPIRES.
The search argument will accept all parameters known to PubMed search, such as filtering by publication year.
Works for single publications, too - set the --limit
parameter to 1 and specify a PubMed ID as the search argument.
Options:
--limit INTEGER
- Total number of citation records to return. Limited by the NCBI API.--get-pmc
/--no-get-pmc
- Attempt to parse PubMed Central full text(s) rather than abstract(s) alone.
Examples:
ontogpt pubmed-annotate -t phenotype "Takotsubo Cardiomyopathy: A Brief Review" --get-pmc --model gpt-3.5-turbo-16k --limit 3
ontogpt pubmed-annotate -t environmental_sample "33126925" --limit 1
ontogpt pubmed-annotate -t composite_disease "(earplugs) AND (("1950"[Date - Publication] : "1990"[Date - Publication]))" --limit 4
pubmed-extract
Extract knowledge from a single PubMed ID.
DEPRECATED - use pubmed-annotate
instead.
recipe-extract
Extract from a recipe on the web.
This uses the recipe
template and the recipe_scrapers package. The latter supports many different recipe web sites, so give your favorite a try.
Pass a URL as the argument, or use the -R option to specify the path to a file containing one URL per line.
Options:
-R
,--recipes-urls-file TEXT
- File with URLs to recipes to use for extraction.
Example:
ontogpt recipe-extract https://www.allrecipes.com/recipe/17445/grilled-asparagus/
In this case, expect an extracted object like the following:
extracted_object:
url: https://www.allrecipes.com/recipe/17445/grilled-asparagus/
label: Grilled Asparagus
description: Grilled asparagus with olive oil, salt, and pepper.
categories:
- AUTO:None
ingredients:
- food_item:
food: FOODON:03311349
state: fresh, spears
amount:
value: '1'
unit: UO:0010034
- food_item:
food: FOODON:03301826
amount:
value: '1'
unit: UO:0010042
- food_item:
food: AUTO:salt
state: and pepper
amount:
value: N/A
unit: AUTO:N/A
steps:
- action: AUTO:Preheat
inputs:
- food: AUTO:outdoor%20grill
state: None
outputs:
- food: AUTO:None
state: None
utensils:
- AUTO:None
- action: dbpediaont:season
inputs:
- food: FOODON:00003458
state: coated
- food: AUTO:salt
state: None
- food: FOODON:00003520
outputs:
- food: FOODON:00003458
state: seasoned
utensils:
- AUTO:None
- action: AUTO:cook
inputs:
- food: FOODON:03311349
state: None
outputs:
- food: FOODON:03311349
state: cooked
utensils:
- AUTO:grill
suggest-template
Suggest a template for a stated topic or purpose.
This uses an LLM, so all options supported by the extract
or complete
functions are usable here as well.
Example:
ontogpt suggest-template "Takotsubo Cardiomyopathy"
In this case, expect a response like this:
The most appropriate template for the topic of Takotsubo Cardiomyopathy would be the composite_disease template. This template is specifically designed for representing composite disease concepts, which fits well with the complex nature of Takotsubo Cardiomyopathy.
synonyms
Extract synonyms, based on embeddings.
The context parameter is required.
Options:
-C
,--context TEXT
- domain, e.g., anatomy, industry, health-related
Example:
ontogpt synonyms --context astronomy star
text-distance
Embed text and calculate euclidian distance between the embeddings.
The terms must be separated by an @
character.
Options:
-C
,--context TEXT
- domain, e.g., anatomy, industry, health-related
Example:
ontogpt text-distance pancakes @ syrup
text-similarity
Like text-distance
, this command compares the embeddings of input terms.
This command returns the cosine similarity of the embedding vectors.
Options:
-C
,--context TEXT
- domain, e.g., anatomy, industry, health-related
Example:
ontogpt text-similarity basketball @ basket-weaving
web-extract
Extract knowledge from web page.
Pass a URL as an argument and OntoGPT will use the SPIRES method to extract information based on the specified template.
Because this depends upon scraping a page, results may vary depending on a site's complexity and structure.
Even relatively short pages may exceed a model's context size, so larger context models may be necessary.
Example:
ontogpt web-extract -t reaction.Reaction -m gpt-3.5-turbo-16k https://www.scienceofcooking.com/maillard_reaction.htm
wikipedia-extract
Extract knowledge from a Wikipedia page.
Pass an article title as an argument and OntoGPT will use the SPIRES method to extract information based on the specified template.
Even relatively short pages may exceed a model's context size, so larger context models may be necessary.
Example:
ontogpt wikipedia-extract -t mendelian_disease.MendelianDisease -m gpt-3.5-turbo-16k "Cartilage–hair hypoplasia"
wikipedia-search
Extract knowledge from Wikipedia pages based on a search.
Pass a search phrase as an argument and OntoGPT will use the SPIRES method to extract information based on the specified template.
Even relatively short pages may exceed a model's context size, so larger context models may be necessary.
Example:
ontogpt wikipedia-search -t biological_process -m gpt-3.5-turbo-16k "digestion"