ontoRunNER
This is a wrapper project around the following named entity recognition (NER) tools:
OGER.
Setup
To setup ontoRunNER,
For users
Activate your virtual environment (poetry or conda or venv etc.)
pip install ontorunner
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz
python -m spacy download en_core_web_sm
Note: If you’re using poetry outside of
poetry shell, precede all CLI commands with apoetry run.
Ontology to KGX TSV
Generate nodes.tsv and edges.tsv files from your OBO JSON ontology file,
CLI
onto-util json2tsv -i ontology.json -o output
Python
from ontorunner.pre.util import json2tsv
json2tsv('ontology.json', 'output.tsv')
Preparing term-list
Generate termlist from the output_nodes.tsv generated in the previous step.
CLI
The conversion can be done as follows,
onto-util prepare-termlist -i output_nodes.tsv -o termlist.tsv
Python
from ontorunner.pre.util import prepare_termlist
prepare_termlist('output_nodes.tsv', 'termlist.tsv')
Running OGER.
Note: Make sure the output directory
data/outputis empty before every run.
You can run OGER against a text document as follows,
CLI
ontoger run -c abstract.txt -t termlist.tsv -o out.json -f bioc_json
Note: This command is just to demonstrate how to run OGER. For more use cases, here is the reference to the OGER documentation.
Running OGER using a ‘settings.ini’ file
You can run OGER using a ‘settings’ file as follows,
CLI
ontoger run -s settings.ini
Python
from ontorunner import oger_module
oger_module.run_oger(settings=settingsFile)
The settings.ini file provides all relevant arguments to OGER. More information on the parameter list could be found at the OGER GitHub
There will be two output tsv files generated:
An output whose filename is exactly similar to the input filename (say
docs.tsv)This is the pure output from
OGER
Another file named
docs_ontoRunNER.tsvwhich contains more results because it is the outcome of some postprocessing.
Running spaCy.
For now, spaCy (within ontoRunNER) can only process documents prepared as a tsv
(or multiple tsv) file(s) with two columns:
id
text
By default, these files are expected to be in the data/input directory. If not, then the user can provide the path of the data directory using the -d or --data-dir parameter.
The settings.ini file used in OGER above is also used by spaCy for some of its parameters.
CLI
ontospacy run
Python
from ontorunner import spacy_module
spacy_module.run_spacy()
There will be two output tsv files generated:
ontology_ontoRunNER.tsv: This file is the output with the ontology termlists (generated above) as the dictionary for entity recognition.umls_ontoRunNER.tsv: This file is the output derived by usingsciSpaCY’sEntityLinker. By default the linker isumlsbut you can provide others as listed here.
Visualization using spaCy.displaCy.
SpaCy visualizers are also available through ontoRunNER! There are two types of visualizers offered by displaCy:
Displays dependencies
Highlights entities
Both are rendered using one command - run-viz.
CLI
ontospacy viz -t **some text**
Python
from ontorunner import spacy_module
test = """A bacterial isolate, designated \
strain SZ,was obtained from noncontaminated creek \
sediment microcosms based on its ability to derive \
energy from acetate oxidation coupled to tetrachloroethene."""
spacy_module.run_viz(text)