Skip to content

đź”· Denotes team, use case, or location-specific details.

⚠️ Denotes warnings and potential pitfalls - pay extra attention to these.

đź§  Background Knowledge

Papers of Note

Ongoing Meetings and Webinar Series

Tutorials and Guides

A Brief Introduction to Natural Language Processing

This section will be useful to you if you are planning to use LLMs or their output directly. If you are planning to use LLMs in an agentic setting (e.g., as a code assistant), you may find the Agentic Coding section more helpful.

  • What is natural language processing and how does it relate to LLMs?
  • Natural Language Processing (NLP) is a subfield of artificial intelligence that deals with the interaction between computers and humans in natural language, encompassing tasks such as sentiment analysis, language translation, and text summarization.
  • LLMs have become a key technology in NLP. They can accomplish many of the tasks previously accomplished using more intensive computational approaches without requiring training of new models or the assembly of carefully-curated dictionaries.
  • What is a token?
  • In natural language terms, this may be a word, part of a word, or a multi-word phrase, but in all cases it is a chunk of a larger sequence.
  • The size of a text may be measured in terms of its total length in tokens
  • For OpenAI, tokens are determined by tiktoken
  • Andrej Karpathy has a good lecture about tokenization here: Let's build the GPT Tokenizer
    • The accompanying Colab notebook is here: Tokenization.ipynb
    • In brief: much of the weirdness we experience in working with LLMs is due to tokenization.
  • What is NER?
  • Named Entity Recognition, or the task of finding named entities in unstructured text.
  • Sometimes known as entity extraction or entity identification.
  • Named entities don’t have a consistent definition, as they depend on the use case, but may include names, locations, concepts, events, time expressions, and more.
  • In our domains, they often include names of diseases, phenotypes, proteins, etc.
  • The task of isolating named entities from text may be considered separate from categorizing them.
  • Related tasks include:
  • What is normalization?
  • The task of replacing a named entity with a more consistent representation.
  • Sometimes known as grounding or concept replacement.
  • This often includes some lexical processing, such as:
    • lemmatization, or reducing a word to its canonical form. The word caring becomes care; the lemma of codes is code.
    • stemming, or reducing a word to a procedurally consistent root. Depending on the stemming algorithm used, the word caring may become car, and codes may become code.
  • In practice, this task requires linking a named entity to a unique identifier, and potentially replacing each instance of the entity in the text with this identifier.
    • For example, replacing every instance of the phrase “fuzzy kiwifruit” with its Food Ontology identifier, FOODON:00003578.
  • What is Relation Extraction?
  • The task of extracting connections or relationships between named entities from unstructured text.
  • Much like NER, this may or may not include categorization of extracted relationships.

A Taxonomy of LLMs

General Purpose LLMs

These models are trained on broad datasets and can perform a wide range of tasks, such as text generation, language translation, and question answering. Most LLMs currently available fit into this category, though there can be some general-purpose models with aspects of the following categories.

Embedding Models

These models are designed to generate dense vector representations (embeddings) of text, which can be used for tasks such as text classification, clustering, and information retrieval. Embedding models are often used as a preprocessing step for other LLMs or machine learning models, and can be fine-tuned for specific tasks or domains.

Reasoning Models

These models are designed to perform complex reasoning tasks, such as logical inference, problem-solving, and decision-making. They often incorporate additional training data or techniques, such as reinforcement learning, to enhance their reasoning capabilities. OpenAI’s o1 model and its subsequent versions (e.g., o3, o4-mini) are reasoning models. Some models (for example, Google’s Gemini 2.5 Flash model) are designed to support reasoning as needed.

Deep Research Models

These models are specifically designed for in-depth research and analysis, often incorporating domain-specific knowledge and expertise. They may be trained on specialized datasets and can perform tasks such as scientific literature analysis, hypothesis generation, and research summarization. Deep research can also be approached as an agentic feature in which a general purpose or reasoning model synthesizes material from a broad collection of sources, in much the same way a human may research a topic or try to find evidence for a specific claim. If you are wondering if this is a computationally intensive process, wonder no further: it is.

Further reading:

Multimodal Models

These models are trained on multiple forms of input data, such as text, images, audio, and video. They can generate output that combines multiple modalities, enabling applications such as image captioning, visual question answering, and multimodal dialogue systems.

Evaluating LLMs and their Results

Evaluating LLMs is crucial to ensure they produce accurate and reliable outputs, as their performance can vary greatly depending on the task, domain, and data. Thorough evaluation helps identify potential errors, inconsistencies, or limitations. This is particularly important for biomedical knowledge resources like Monarch as we want the community to feel like they can trust our data and methods.

NLP methods and language models have traditionally been evaluated through statistical methods. These metrics, like BLEU and ROUGE, were popular for determining accuracy in tasks like natural language translation and summarization. They are also fairly straightforward to interpret and run. If a model is expected to produce a specific sequence (e.g., an exact word), then metrics like Levenshtein distance, Shannon entropy, or even simple exact string matching may be appropriate and informative. Statistical evaluators may be insufficient for many of the tasks we use LLMs for, however: they don’t account for reasoning or semantics, nor can they properly evaluate complex outputs.

Development of new LLM evaluation metrics is an active field. The most appropriate approach will likely depend upon your specific tasks and use cases. Conveniently, there are tools and frameworks to help:

  • llm-matrix
  • This Monarch tool allows us to run, evaluate, and compare different LLMs across a matrix of hyperparameters.
  • For example, it can run a set of test cases against a matrix of different models and temperature parameters.
  • LangChain
  • The LangSmith framework (part of LangChain) has evaluation features. See more details here: Evaluation concepts | 🦜️🛠️ LangSmith
  • DeepEval
  • This evaluation framework is described as “similar to Pytest but specialized for unit testing LLM outputs”.

See the list of evaluation tools for more details.