# ontoRunNER This is a wrapper project around the following named entity recognition (NER) tools: - [OGER](https://github.com/OntoGene/OGER). - [spaCy](https://spacy.io) - using [sciSpaCy](https://scispacy.apps.allenai.org) pipeline namely the CRAFT corpus (`en_ner_craft_md`) used by default. Others can be used as listed [here](https://github.com/allenai/scispacy#available-models) ## Setup To setup `ontoRunNER`, ### For developers After cloning the repository: ``` poetry install ``` ### For users Activate your virtual environment (`poetry` or `conda` or `venv` etc.) ``` pip install ontorunner pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz python -m spacy download en_core_web_sm ``` > Note: If you're using poetry outside of `poetry shell`, precede all CLI commands with a `poetry run`. ## Ontology to [KGX](https://github.com/biolink/kgx) TSV Generate `nodes.tsv` and `edges.tsv` files from your OBO JSON ontology file, ### CLI ``` onto-util json2tsv -i ontology.json -o output ``` ### Python ``` from ontorunner.pre.util import json2tsv json2tsv('ontology.json', 'output.tsv') ``` ## Preparing term-list Generate termlist from the `output_nodes.tsv` generated in the previous step. ### CLI The conversion can be done as follows, ``` onto-util prepare-termlist -i output_nodes.tsv -o termlist.tsv ``` ### Python ``` from ontorunner.pre.util import prepare_termlist prepare_termlist('output_nodes.tsv', 'termlist.tsv') ``` ## Running OGER. > **Note:** Make sure the output directory [`data/output`](https://github.com/monarch-initiative/ontorunner/tree/master/data/output) is empty before every run. You can run OGER against a text document as follows, ### CLI ``` ontoger run -c abstract.txt -t termlist.tsv -o out.json -f bioc_json ``` > **Note:** This command is just to demonstrate how to run OGER. > For more use cases, [here](https://github.com/OntoGene/OGER/wiki/run) > is the reference to the OGER documentation. ### Running OGER using a 'settings.ini' file You can run `OGER` using a 'settings' file as follows, ### CLI ``` ontoger run -s settings.ini ``` ### Python ``` from ontorunner import oger_module oger_module.run_oger(settings=settingsFile) ``` > The [settings.ini](https://github.com/monarch-initiative/ontorunner/blob/master/ontorunner/settings.ini) file provides all relevant arguments to OGER. More information on the parameter list could be found at the [OGER GitHub](https://github.com/OntoGene/OGER/wiki/run#parameter-index) There will be two output tsv files generated: - An output whose filename is exactly similar to the input filename (say `docs.tsv`) - This is the pure output from `OGER` - Another file named `docs_ontoRunNER.tsv` which contains more results because it is the outcome of some postprocessing. ## Running spaCy. For now, `spaCy` (within `ontoRunNER`) can only process documents prepared as a tsv (or multiple tsv) file(s) with two columns: - id - text By default, these files are expected to be in the [`data/input`](https://github.com/monarch-initiative/ontorunner/tree/master/data/input) directory. If not, then the user can provide the path of the data directory using the `-d` or `--data-dir` parameter. The `settings.ini` file used in `OGER` above is also used by `spaCy` for some of its parameters. ### CLI ``` ontospacy run ``` ### Python ``` from ontorunner import spacy_module spacy_module.run_spacy() ``` There will be two output tsv files generated: - `ontology_ontoRunNER.tsv`: This file is the output with the ontology termlists (generated above) as the dictionary for entity recognition. - `umls_ontoRunNER.tsv`: This file is the output derived by using `sciSpaCY`'s `EntityLinker`. By default the linker is `umls` but you can provide others as listed [here](https://github.com/allenai/scispacy#entitylinker). ## Visualization using `spaCy.displaCy`. SpaCy visualizers are also available through ontoRunNER! There are two types of visualizers offered by displaCy: - Displays dependencies - Highlights entities Both are rendered using one command - `run-viz`. ### CLI ``` ontospacy viz -t **some text** ``` ### Python ``` from ontorunner import spacy_module test = """A bacterial isolate, designated \ strain SZ,was obtained from noncontaminated creek \ sediment microcosms based on its ability to derive \ energy from acetate oxidation coupled to tetrachloroethene.""" spacy_module.run_viz(text) ``` ### Dependency display using displaCy. ![Sentence Dependency](../../data/images/example_dependency.svg) [Click here for larger view](https://raw.githubusercontent.com/monarch-initiative/ontorunner/master/data/images/example_dependency.svg) ### Entity display using displaCy. displaCy
AbacterialTAXON isolate, designated strainPATO:0001034 [ strain ] SZ,was obtained from noncontaminated creekENVO:00000023 [ stream ] sedimentENVO:00002007 [ sediment ] microcosmsTAXON based on its ability to derive energyPATO:0001021 [ energy ] from acetateCHEBI oxidationMOP:0000568 [ oxidation ] coupled to tetrachloroetheneCHEBI:17300 [ tetrachloroethene ].