ontoRunNER
This is a wrapper project around the following named entity recognition (NER) tools:
OGER.
Setup
To setup ontoRunNER
,
For users
Activate your virtual environment (poetry
or conda
or venv
etc.)
pip install ontorunner
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz
python -m spacy download en_core_web_sm
Note: If you’re using poetry outside of
poetry shell
, precede all CLI commands with apoetry run
.
Ontology to KGX TSV
Generate nodes.tsv
and edges.tsv
files from your OBO JSON ontology file,
CLI
onto-util json2tsv -i ontology.json -o output
Python
from ontorunner.pre.util import json2tsv
json2tsv('ontology.json', 'output.tsv')
Preparing term-list
Generate termlist from the output_nodes.tsv
generated in the previous step.
CLI
The conversion can be done as follows,
onto-util prepare-termlist -i output_nodes.tsv -o termlist.tsv
Python
from ontorunner.pre.util import prepare_termlist
prepare_termlist('output_nodes.tsv', 'termlist.tsv')
Running OGER.
Note: Make sure the output directory
data/output
is empty before every run.
You can run OGER against a text document as follows,
CLI
ontoger run -c abstract.txt -t termlist.tsv -o out.json -f bioc_json
Note: This command is just to demonstrate how to run OGER. For more use cases, here is the reference to the OGER documentation.
Running OGER using a ‘settings.ini’ file
You can run OGER
using a ‘settings’ file as follows,
CLI
ontoger run -s settings.ini
Python
from ontorunner import oger_module
oger_module.run_oger(settings=settingsFile)
The settings.ini file provides all relevant arguments to OGER. More information on the parameter list could be found at the OGER GitHub
There will be two output tsv files generated:
An output whose filename is exactly similar to the input filename (say
docs.tsv
)This is the pure output from
OGER
Another file named
docs_ontoRunNER.tsv
which contains more results because it is the outcome of some postprocessing.
Running spaCy.
For now, spaCy
(within ontoRunNER
) can only process documents prepared as a tsv
(or multiple tsv) file(s) with two columns:
id
text
By default, these files are expected to be in the data/input
directory. If not, then the user can provide the path of the data directory using the -d
or --data-dir
parameter.
The settings.ini
file used in OGER
above is also used by spaCy
for some of its parameters.
CLI
ontospacy run
Python
from ontorunner import spacy_module
spacy_module.run_spacy()
There will be two output tsv files generated:
ontology_ontoRunNER.tsv
: This file is the output with the ontology termlists (generated above) as the dictionary for entity recognition.umls_ontoRunNER.tsv
: This file is the output derived by usingsciSpaCY
’sEntityLinker
. By default the linker isumls
but you can provide others as listed here.
Visualization using spaCy.displaCy
.
SpaCy visualizers are also available through ontoRunNER! There are two types of visualizers offered by displaCy:
Displays dependencies
Highlights entities
Both are rendered using one command - run-viz
.
CLI
ontospacy viz -t **some text**
Python
from ontorunner import spacy_module
test = """A bacterial isolate, designated \
strain SZ,was obtained from noncontaminated creek \
sediment microcosms based on its ability to derive \
energy from acetate oxidation coupled to tetrachloroethene."""
spacy_module.run_viz(text)