🧠 Background Knowledge¶

Papers of Note¶

Eight Things to Know about Large Language Models

Ongoing Meetings and Webinar Series¶

Monarch OBO academy’s recordings/webinars
Monarch Huddle Technology Spotlights: Tutorial Recordings
OntoGPT Conclave Notes (this weekly meeting focuses on applications of OntoGPT and LLMs in general)

Tutorials and Guides¶

Andrej Karpathy’s[1hr Talk] Intro to Large Language Models
Simon Williamson talks:
Making Large Language Models Work For Youand annotated transcript
"Catching up on the weird world of LLMs" - Simon Willison (North Bay Python 2023) and annotated transcript
Huggingface’s intro is pretty good for using and tuning their models locally: https://huggingface.co/learn/nlp-course/
Shawn O’Neil (Monarch, TISLab) - “Everything you wanted to know about LLMs but were too afraid to ask”: llm_afraid_to_ask_mar_23.pdf
Shawn’s proposal for an AI workshop (includes various useful links for further info about the topic): AI Workshop
Language models and Knowledge Bases/Ontologies: Monarch Strategy (2022)
Large Language Models: GO (this was for Gene Ontology)
Reese - SC23 - Practical Machine Learning on Knowledge Graphs
On the similarity of similarity methods(not purely about LLMs but relates to their applications)
Chris Mungall’s presentations on agentic AI from ISMB2025
Mungall - ISMB 2025 TM Keynote
https://zenodo.org/records/16461373 (slides from BOSC/BOKR keynote)
Chris Mungall’s talk from the 2025 NIH Knowledge Networks Meeting
https://zenodo.org/records/15554695 (slides)
This is primarily about knowledge graphs and ontologies, but also about applying AI to curating these resources

A Brief Introduction to Natural Language Processing¶

This section will be useful to you if you are planning to use LLMs or their output directly. If you are planning to use LLMs in an agentic setting (e.g., as a code assistant), you may find the Agentic Coding section more helpful.

What is natural language processing and how does it relate to LLMs?
Natural Language Processing (NLP) is a subfield of artificial intelligence that deals with the interaction between computers and humans in natural language, encompassing tasks such as sentiment analysis, language translation, and text summarization.
LLMs have become a key technology in NLP. They can accomplish many of the tasks previously accomplished using more intensive computational approaches without requiring training of new models or the assembly of carefully-curated dictionaries.
What is a token?
In natural language terms, this may be a word, part of a word, or a multi-word phrase, but in all cases it is a chunk of a larger sequence.
The size of a text may be measured in terms of its total length in tokens
For OpenAI, tokens are determined by tiktoken
- Or get a rough idea with this tool
Andrej Karpathy has a good lecture about tokenization here: Let's build the GPT Tokenizer
- The accompanying Colab notebook is here: Tokenization.ipynb
- In brief: much of the weirdness we experience in working with LLMs is due to tokenization.
What is NER?
Named Entity Recognition, or the task of finding named entities in unstructured text.
Sometimes known as entity extraction or entity identification.
Named entities don’t have a consistent definition, as they depend on the use case, but may include names, locations, concepts, events, time expressions, and more.
In our domains, they often include names of diseases, phenotypes, proteins, etc.
The task of isolating named entities from text may be considered separate from categorizing them.
Related tasks include:
- coreference resolution, or linking named entities to referential words and expressions like it
- word sense disambiguation
- normalization - see below
- relation extraction - see below
What is normalization?
The task of replacing a named entity with a more consistent representation.
Sometimes known as grounding or concept replacement.
This often includes some lexical processing, such as:
- lemmatization, or reducing a word to its canonical form. The word caring becomes care; the lemma of codes is code.
- stemming, or reducing a word to a procedurally consistent root. Depending on the stemming algorithm used, the word caring may become car, and codes may become code.
In practice, this task requires linking a named entity to a unique identifier, and potentially replacing each instance of the entity in the text with this identifier.
- For example, replacing every instance of the phrase “fuzzy kiwifruit” with its Food Ontology identifier, FOODON:00003578.
What is Relation Extraction?
The task of extracting connections or relationships between named entities from unstructured text.
Much like NER, this may or may not include categorization of extracted relationships.

A Taxonomy of LLMs¶

General Purpose LLMs¶

These models are trained on broad datasets and can perform a wide range of tasks, such as text generation, language translation, and question answering. Most LLMs currently available fit into this category, though there can be some general-purpose models with aspects of the following categories.

Embedding Models¶

These models are designed to generate dense vector representations (embeddings) of text, which can be used for tasks such as text classification, clustering, and information retrieval. Embedding models are often used as a preprocessing step for other LLMs or machine learning models, and can be fine-tuned for specific tasks or domains.

Reasoning Models¶

These models are designed to perform complex reasoning tasks, such as logical inference, problem-solving, and decision-making. They often incorporate additional training data or techniques, such as reinforcement learning, to enhance their reasoning capabilities. OpenAI’s o1 model and its subsequent versions (e.g., o3, o4-mini) are reasoning models. Some models (for example, Google’s Gemini 2.5 Flash model) are designed to support reasoning as needed.

Deep Research Models¶

These models are specifically designed for in-depth research and analysis, often incorporating domain-specific knowledge and expertise. They may be trained on specialized datasets and can perform tasks such as scientific literature analysis, hypothesis generation, and research summarization. Deep research can also be approached as an agentic feature in which a general purpose or reasoning model synthesizes material from a broad collection of sources, in much the same way a human may research a topic or try to find evidence for a specific claim. If you are wondering if this is a computationally intensive process, wonder no further: it is.

Multimodal Models¶

These models are trained on multiple forms of input data, such as text, images, audio, and video. They can generate output that combines multiple modalities, enabling applications such as image captioning, visual question answering, and multimodal dialogue systems.

Evaluating LLMs and their Results¶

Evaluating LLMs is crucial to ensure they produce accurate and reliable outputs, as their performance can vary greatly depending on the task, domain, and data. Thorough evaluation helps identify potential errors, inconsistencies, or limitations. This is particularly important for biomedical knowledge resources like Monarch as we want the community to feel like they can trust our data and methods.

NLP methods and language models have traditionally been evaluated through statistical methods. These metrics, like BLEU and ROUGE, were popular for determining accuracy in tasks like natural language translation and summarization. They are also fairly straightforward to interpret and run. If a model is expected to produce a specific sequence (e.g., an exact word), then metrics like Levenshtein distance, Shannon entropy, or even simple exact string matching may be appropriate and informative. Statistical evaluators may be insufficient for many of the tasks we use LLMs for, however: they don’t account for reasoning or semantics, nor can they properly evaluate complex outputs.

Development of new LLM evaluation metrics is an active field. The most appropriate approach will likely depend upon your specific tasks and use cases. Conveniently, there are tools and frameworks to help:

llm-matrix
This Monarch tool allows us to run, evaluate, and compare different LLMs across a matrix of hyperparameters.
For example, it can run a set of test cases against a matrix of different models and temperature parameters.
LangChain
The LangSmith framework (part of LangChain) has evaluation features. See more details here: Evaluation concepts | 🦜️🛠️ LangSmith
DeepEval
This evaluation framework is described as “similar to Pytest but specialized for unit testing LLM outputs”.

See the list of evaluation tools for more details.