ontoRunNER

This is a wrapper project around the following named entity recognition (NER) tools:

OGER.
spaCy
- using sciSpaCy pipeline namely the CRAFT corpus (en_ner_craft_md) used by default. Others can be used as listed here

Setup

To setup ontoRunNER,

For developers

After cloning the repository:

poetry install

For users

Activate your virtual environment (poetry or conda or venv etc.)

pip install ontorunner

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz

python -m spacy download en_core_web_sm

Note: If you’re using poetry outside of poetry shell, precede all CLI commands with a poetry run.

Ontology to KGX TSV

Generate nodes.tsv and edges.tsv files from your OBO JSON ontology file,

CLI

onto-util json2tsv -i ontology.json -o output

Python

from ontorunner.pre.util import json2tsv
json2tsv('ontology.json', 'output.tsv')

Preparing term-list

Generate termlist from the output_nodes.tsv generated in the previous step.

CLI

The conversion can be done as follows,

onto-util prepare-termlist -i output_nodes.tsv -o termlist.tsv

Python

from ontorunner.pre.util import prepare_termlist
prepare_termlist('output_nodes.tsv', 'termlist.tsv')

Running OGER.

Note: Make sure the output directory data/output is empty before every run.

You can run OGER against a text document as follows,

CLI

ontoger run -c abstract.txt -t termlist.tsv -o out.json -f bioc_json

Note: This command is just to demonstrate how to run OGER. For more use cases, here is the reference to the OGER documentation.

Running OGER using a ‘settings.ini’ file

You can run OGER using a ‘settings’ file as follows,

CLI

ontoger run -s settings.ini

Python

from ontorunner import oger_module
oger_module.run_oger(settings=settingsFile)

The settings.ini file provides all relevant arguments to OGER. More information on the parameter list could be found at the OGER GitHub

There will be two output tsv files generated:

An output whose filename is exactly similar to the input filename (say docs.tsv)
- This is the pure output from OGER
Another file named docs_ontoRunNER.tsv which contains more results because it is the outcome of some postprocessing.

Running spaCy.

For now, spaCy (within ontoRunNER) can only process documents prepared as a tsv (or multiple tsv) file(s) with two columns:

id
text

By default, these files are expected to be in the data/input directory. If not, then the user can provide the path of the data directory using the -d or --data-dir parameter.

The settings.ini file used in OGER above is also used by spaCy for some of its parameters.

CLI

ontospacy run

Python

from ontorunner import spacy_module
spacy_module.run_spacy()

There will be two output tsv files generated:

ontology_ontoRunNER.tsv: This file is the output with the ontology termlists (generated above) as the dictionary for entity recognition.
umls_ontoRunNER.tsv: This file is the output derived by using sciSpaCY’s EntityLinker. By default the linker is umls but you can provide others as listed here.

Visualization using `spaCy.displaCy`.

SpaCy visualizers are also available through ontoRunNER! There are two types of visualizers offered by displaCy:

Displays dependencies
Highlights entities

Both are rendered using one command - run-viz.

CLI

ontospacy viz -t **some text**

Python

from ontorunner import spacy_module

test = """A bacterial isolate, designated \
strain SZ,was obtained from noncontaminated creek \
sediment microcosms based on its ability to derive \
energy from acetate oxidation coupled to tetrachloroethene."""

spacy_module.run_viz(text)

Dependency display using displaCy.

Sentence Dependency

Click here for larger view

Entity display using displaCy.

displaCy

AbacterialTAXON isolate, designated strainPATO:0001034 [ strain ] SZ,was obtained from noncontaminated creekENVO:00000023 [ stream ] sedimentENVO:00002007 [ sediment ] microcosmsTAXON based on its ability to derive energyPATO:0001021 [ energy ] from acetateCHEBI oxidationMOP:0000568 [ oxidation ] coupled to tetrachloroetheneCHEBI:17300 [ tetrachloroethene ].

ontoRunNER

Setup

For developers

For users

Ontology to KGX TSV

CLI

Python

Preparing term-list

CLI

Python

Running OGER.

CLI

Running OGER using a ‘settings.ini’ file

CLI

Python

Running spaCy.

CLI

Python

Visualization using spaCy.displaCy.

CLI

Python

Dependency display using displaCy.

Entity display using displaCy.

Visualization using `spaCy.displaCy`.