AutoMAxO Tutorial

Welcome to the AutoMAxO tutorial. This guide will walk you through running the AutoMAxO project step by step.

Note:

There are two options to run the whole AutoMAxO pipeline depending on your usage: 1. You can run the entire project with just one command if you would like to retrieve a certain number of articles that discuss therapeutics and extract medical actions.

python main.py --disease_name "YourDiseaseName" --max_articles_to_save 100

Replace "YourDiseaseName" with the name of the disease you want to process and adjust 100 to the desired number of articles to save.

You can run each file separately to easily modify and integrate AutoMAxO into different applications to accomplish various tasks. In this tutorial, we will be using "sickle cell" as a sample disease, but you can change it to any name you prefer. Follow the step-by-step instructions below:

Step 1: Extract MeSH Sets from Target MeSH IDs

The script, mesh_importer.py, reads a list of targeted MeSH IDs and their labels to create a formatted file with MeSH sets required for the AutoMAxO project.

Option	Meaning
--input-file	Path to the .tsv file containing MeSH target IDs
--output-file	Path to the output .tsv file with MeSH sets

For example:

python mesh_importer.py --input-file path/to/mesh_target_ids.tsv --output-file path/to/mesh_sets.tsv

The default format for mesh_target_ids in the project is as follows:

mesh.id	label
D013812	Therapeutics

Sample output for mesh_sets:

label	mesh.id	mesh set
Therapeutics	D013812	061645;D000075162;D000161;D000203;D019050

Step 2: Retrieve MeSH IDs

The script, pubmed_article_fetcher.py,start by retrieving raw data and MeSH IDs related to the treatment of your specified disease. In this stage, the script first checks whether there are already existing articles in the directory to avoid duplicate extraction. It ensures that the new articles being extracted are not the same as the ones already existing in the directory, and attempts to extract up to the maximum number specified by the user.

Option	Meaning
-d	Disease name
-m	Path to the .tsv file with MeSH sets created in Step 1
-o	Directory where retrieved articles will be saved in the form of .json files
-j	Path to the .json file to save MeSH IDs related to retrieved articles
-n	Maximum number of articles to retrieve

For example:

python pubmed_article_fetcher.py -d "sickle cell" -m ../../data/mesh_sets.tsv -o ../../data/sickle_cell/pubtator3_json/ -j ../../data/sickle_cell/selected_pmid_mesh_info.json -n 2

Note: * The disease name is not case-sensitive. * The maximum number of articles will include articles both about a specific disease and having at least one of the MeSH IDs in the MeSH sets from Step 1, meaning articles about therapeutics of a specific disease, for example, in our use case.

Step 3: Pre-process Extracted Data from JSON Files

The script, article_data_extractor.py, extracts data from JSON files and saves the text where each row represents the title and abstract of an article.

Option	Meaning
-i	Directory containing JSON files produced in Step 2
-n	Path to the .tsv file for pre-processed text data

For example:

python article_data_extractor.py -i ../../data/sickle_cell/pubtator3_json/ -n ../../data/sickle_cell/sickle_cell_no_replaced.tsv

If you would like to extract annotations from custom texts, including full PubMed Central texts, websites, or other text collections, you need to format your text into a TSV (Tab-Separated Values) file. This file must contain three columns: PMID, Title, and Abstract. The PMID column can be left empty or populated with unique identifiers if available. The format should match the output format required by the processing function.

PMID	Title	Abstract
28669521	Management of delayed hemolytic transfusion reaction ...	Transfusion remains a key treatment of sickle cell disease...

Step 4: Integrate OntoGPT

The script, ontogpt_article_processor.py, processes the text data from the pre-processed .tsv file using OntoGPT. Each row in the .tsv file is treated as an input text for OntoGPT. For more information about OntoGPT, please refer to this repo.

Option	Meaning
-i	Path to the .tsv file for pre-processed text produced in Step 3
-o	Directory containing YAML files produced by LLMs (OntoGPT)
-template	Name of the template for OntoGPT (default = 'maxo')

For example:

python ontogpt_article_processor.py -i ../../data/sickle_cell/sickle_cell_no_replaced.tsv -o ../../data/sickle_cell/ontoGPT_yaml/

Step 5: Post-process LLM Results

The script, ontology_validation.py, Post-process LLM results from separate YAML files into a single JSON file. This includes further grounding of terms to the existing ontologies and ranking extracted triplets by frequency of appearance.

Option	Meaning
-i	Directory containing YAML files produced in Step 4
-s	Path to the .json file to save MeSH IDs related to retrieved articles produced in Step 2
-n	Path to the .tsv file for pre-processed text produced in Step 3
-o	Path to the .json file to save post-processed LLM results in one file

For example:

python triplet_ranking_and_mesh_combiner.py -i ../../data/sickle_cell/ontoGPT_yaml/ -s ../../data/sickle_cell/selected_pmid_mesh_info.json -n ../../data/sickle_cell/sickle_cell_no_replaced.tsv -o ../../data/sickle_cell/detailed_post_ontoGPT.json

Step 6: Validate Annotations

The script, ontology_validation.py, updates ontology labels in a JSON file by validating and augmenting MAXO, HPO, and MONDO IDs with their corresponding labels. It processes the JSON data, generates ontology term files using runoak, and integrates these terms into the JSON file to enhance the dataset with accurate ontology information.

Option	Meaning
-json_file_path	Path to the .json file of post-processed LLM results produced in Step 5
-output_file_path	Path to the .json file, final result of validated ontologies produced by automaxo

For example:

python ontology_validation.py ../../data/sickle_cell/detailed_post_ontoGPT.json ../../data/sickle_cell/final_automaxo_results.json

Running the Script

You can run the script using the following command:

python main.py --disease_name "YourDiseaseName" --max_articles_to_save 100

Replace "YourDiseaseName" with the name of the disease you want to process and adjust 100 to the desired number of articles to save.