ontoRunNER

This is a wrapper project around the following named entity recognition (NER) tools:

  • OGER.

  • spaCy

    • using sciSpaCy pipeline namely the CRAFT corpus (en_ner_craft_md) used by default. Others can be used as listed here

Setup

To setup ontoRunNER,

For developers

After cloning the repository:

poetry install

For users

Activate your virtual environment (poetry or conda or venv etc.)

pip install ontorunner

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz

python -m spacy download en_core_web_sm

Note: If you’re using poetry outside of poetry shell, precede all CLI commands with a poetry run.

Ontology to KGX TSV

Generate nodes.tsv and edges.tsv files from your OBO JSON ontology file,

CLI

onto-util json2tsv -i ontology.json -o output

Python

from ontorunner.pre.util import json2tsv
json2tsv('ontology.json', 'output.tsv')

Preparing term-list

Generate termlist from the output_nodes.tsv generated in the previous step.

CLI

The conversion can be done as follows,

onto-util prepare-termlist -i output_nodes.tsv -o termlist.tsv

Python

from ontorunner.pre.util import prepare_termlist
prepare_termlist('output_nodes.tsv', 'termlist.tsv')

Running OGER.

Note: Make sure the output directory data/output is empty before every run.

You can run OGER against a text document as follows,

CLI

ontoger run -c abstract.txt -t termlist.tsv -o out.json -f bioc_json

Note: This command is just to demonstrate how to run OGER. For more use cases, here is the reference to the OGER documentation.

Running OGER using a ‘settings.ini’ file

You can run OGER using a ‘settings’ file as follows,

CLI

ontoger run -s settings.ini

Python

from ontorunner import oger_module
oger_module.run_oger(settings=settingsFile)

The settings.ini file provides all relevant arguments to OGER. More information on the parameter list could be found at the OGER GitHub

There will be two output tsv files generated:

  • An output whose filename is exactly similar to the input filename (say docs.tsv)

    • This is the pure output from OGER

  • Another file named docs_ontoRunNER.tsv which contains more results because it is the outcome of some postprocessing.

Running spaCy.

For now, spaCy (within ontoRunNER) can only process documents prepared as a tsv (or multiple tsv) file(s) with two columns:

  • id

  • text

By default, these files are expected to be in the data/input directory. If not, then the user can provide the path of the data directory using the -d or --data-dir parameter.

The settings.ini file used in OGER above is also used by spaCy for some of its parameters.

CLI

ontospacy run

Python

from ontorunner import spacy_module
spacy_module.run_spacy()

There will be two output tsv files generated:

  • ontology_ontoRunNER.tsv: This file is the output with the ontology termlists (generated above) as the dictionary for entity recognition.

  • umls_ontoRunNER.tsv: This file is the output derived by using sciSpaCY’s EntityLinker. By default the linker is umls but you can provide others as listed here.

Visualization using spaCy.displaCy.

SpaCy visualizers are also available through ontoRunNER! There are two types of visualizers offered by displaCy:

  • Displays dependencies

  • Highlights entities

Both are rendered using one command - run-viz.

CLI

ontospacy viz -t **some text**

Python

from ontorunner import spacy_module

test = """A bacterial isolate, designated \
strain SZ,was obtained from noncontaminated creek \
sediment microcosms based on its ability to derive \
energy from acetate oxidation coupled to tetrachloroethene."""

spacy_module.run_viz(text)

Dependency display using displaCy.

Sentence Dependency

Click here for larger view

Entity display using displaCy.

displaCy
AbacterialTAXON isolate, designated strainPATO:0001034 [ strain ] SZ,was obtained from noncontaminated creekENVO:00000023 [ stream ] sedimentENVO:00002007 [ sediment ] microcosmsTAXON based on its ability to derive energyPATO:0001021 [ energy ] from acetateCHEBI oxidationMOP:0000568 [ oxidation ] coupled to tetrachloroetheneCHEBI:17300 [ tetrachloroethene ].