curategpt.agents package

Submodules

curategpt.agents.agent_utils module

Agent Utilities.

curategpt.agents.agent_utils.select_from_options_prompt(kb_results, model, obj_type='Reference', query=None, prompt_template=None, id_field=None)

Prompt user to select from a list of options.

Parameters:
  • kb_results (List[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]) – order from most relevant

  • model (Model)

  • obj_type (str)

  • query (Optional[str])

  • prompt_template (Optional[str])

Return type:

Tuple[str, Dict[str, str], Dict]

Returns:

curategpt.agents.base_agent module

Base Agent.

class curategpt.agents.base_agent.BaseAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)

Bases: ABC

Base class for agents.

An agent is capable of composing together different actions to achieve a goal.

An agent typically has a knowledge source that is uses to search for information. An agent also has access to a model through an extractor.

extractor: Extractor = None

Engine performing LLM operations, including extracting from prompt responses

knowledge_source: Union[DBAdapter, BaseWrapper] = None

A searchable source of information

knowledge_source_collection: str = None
search()

curategpt.agents.bootstrap_agent module

class curategpt.agents.bootstrap_agent.BootstrapAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)

Bases: BaseAgent

bootstrap_data(specification=None, schema=None)

Bootstrap data for a knowledge base.

Parameters:
  • specification (Optional[KnowledgeBaseSpecification]) – Specification for the knowledge base.

  • schema (Optional[Dict]) – Schema for the knowledge base.

Return type:

str

Returns:

bootstrap_schema(specification)

Bootstrap a schema for a knowledge base.

Parameters:

specification (KnowledgeBaseSpecification) – Specification for the knowledge base.

Return type:

AnnotatedObject

Returns:

class curategpt.agents.bootstrap_agent.KnowledgeBaseSpecification(**data)

Bases: BaseModel

attributes: str
description: str
kb_name: str
main_class: str
model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

curategpt.agents.chat_agent module

Chat with a KB.

class curategpt.agents.chat_agent.ChatAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=0.5, conversation_id=None)

Bases: BaseAgent

An agent that allows chat to a knowledge source.

This implements a standard knowledgebase retrieval augmented generation pattern. The knowledge_source is queried for relevant objects (the source can be a local database or a remote source such as pubmed). The objects are provided as context to a LLM query

chat(query, conversation=None, limit=10, collection=None, expand=True, **kwargs)

Extract structured object using text seed and background knowledge.

Parameters:
  • text

  • kwargs

Return type:

ChatResponse

Returns:

conversation_id: Optional[str] = None
relevance_factor: float = 0.5

Relevance factor for diversifying search results using MMR.

class curategpt.agents.chat_agent.ChatResponse(**data)

Bases: BaseModel

Response from chat engine.

TODO: Rename class to indicate that it is provenance-enabled chat

body: str

Text of response.

formatted_body: str

Body formatted with markdown links to references.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: str

Prompt used to generate response.

references: Optional[Dict[str, Any]]

References for citations detected in response.

uncited_references: Optional[Dict[str, Any]]

Potential references for which there was no detected citation.

Replace references with links.

curategpt.agents.concept_recognition_agent module

Annotation (Concept Recognition) in texts.

class curategpt.agents.concept_recognition_agent.AnnotatedText(**data)

Bases: BaseModel

In input text annotated with concept instances.

annotated_text: Optional[str]

Text with concepts annotated (not all methods produce this).

concepts: Optional[Dict[str, str]]

Dictionary of concepts found in the text. TODO: change to list of spans.

input_text: str

Text that is supplied for annotation.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: Optional[str]

Prompt used to generate the annotated text.

spans: Optional[List[Span]]
summary: Optional[str]

Summary of the results.

class curategpt.agents.concept_recognition_agent.AnnotationMethod(value)

Bases: str, Enum

Strategy or algorithm used for CR.

CONCEPT_LIST = 'concept_list'

LLM creates a list of concepts

INLINE = 'inline'

LLM creates an annotated document

TWO_PASS = 'two_pass'

LLM annotates a document using NER and then grounds the concepts

class curategpt.agents.concept_recognition_agent.ConceptRecognitionAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, identifier_field=None, label_field=None, split_input_text=None, relevance_factor=0.8, prefixes=None)

Bases: BaseAgent

annotate(text, collection=None, method=AnnotationMethod.INLINE, **kwargs)
Return type:

AnnotatedText

annotate_concept_list(text, collection=None, categories=None, **kwargs)
Return type:

AnnotatedText

annotate_inline(text, collection=None, categories=None, **kwargs)
Return type:

AnnotatedText

annotate_two_pass(text, collection=None, categories=None, **kwargs)
Return type:

AnnotatedText

ground_concept(text, collection=None, categories=None, include_category_in_search=True, context=None, **kwargs)
Return type:

GroundingResult

identifier_field: str = None

Field to use as identifier for objects.

label_field: str = None

Field to use as label for objects.

prefixes: List[str] = None

List of prefixes to use for concept IDs.

relevance_factor: float = 0.8

Relevance factor for diversifying search results using MMR.

split_input_text: bool = None
class curategpt.agents.concept_recognition_agent.GroundingResult(**data)

Bases: BaseModel

Result of grounding text.

input_text: str

Text that is supplied for grounding, assumed to contain a single context.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

score: Optional[float]

Score/confidence, from zero to one.

spans: Optional[List[Span]]

Ordered list of candidate spans.

class curategpt.agents.concept_recognition_agent.Span(**data)

Bases: BaseModel

An individual span of text containing a single concept.

concept_id: str

Concept ID.

concept_label: Optional[str]

Concept label.

end: Optional[int]
is_suspect: Optional[bool]

Potential hallucination due to ID/label mismatch.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

start: Optional[int]
text: str
curategpt.agents.concept_recognition_agent.parse_annotations(text, marker_char=None)

Parse annotations from text.

>>> text = ("A minimum diagnostic criterion is the combination of either the [skin tumours] or multiple "
...        "[odontogenic keratocysts HP:0010603] of the jaw plus a positive [family history HP:0032316] "
...        "for this disorder, [bifid ribs HP:0000923], lamellar [calcification of falx cerebri HP:0005462] "
...        "or any one of the skeletal abnormalities typical of this syndrome")
>>> for ann in parse_annotations(text):
...    print(ann)
('skin tumours', None)
('odontogenic keratocysts', 'HP:0010603')
('family history', 'HP:0032316')
('bifid ribs', 'HP:0000923')
('calcification of falx cerebri', 'HP:0005462')

For texts with marker characters:

>>> text = "for this disorder, [bifid ribs | HP:0000923], lamellar [calcification of falx cerebri | HP:0005462] "
>>> for ann in parse_annotations(text, "|"):
...    print(ann)
('bifid ribs', 'HP:0000923')
('calcification of falx cerebri', 'HP:0005462')
Parameters:

text

Return type:

List[Tuple[str, str]]

Returns:

curategpt.agents.concept_recognition_agent.parse_spans(text, concept_dict=None)
Return type:

List[Span]

curategpt.agents.dase_agent module

Autocomplete objects using RAG.

class curategpt.agents.dase_agent.DatabaseAugmentedStructuredExtraction(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=1000, background_document_limit=3, default_masked_fields=<factory>)

Bases: BaseAgent

Extracts structured objects from unstructured documents.

This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.

background_document_limit: int = 3

Number of background documents to use. TODO: more sophisticated way to estimate.

conversation: List[Dict[str, Any]] = None
conversation_mode: bool = False
default_masked_fields: List[str]
default_target_class: ClassVar[str] = 'Thing'
document_adapter: DBAdapter = None

Adapter to supplementary knowledge in unstructured form.

document_adapter_collection: str = None

Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters

extract(text, target_class=None, feature_fields=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)

Populate structured object from text

Parameters:
  • seed

  • target_class (Optional[str])

  • context_property

  • generate_background

  • collection (Optional[str])

  • rules (Optional[List[str]])

  • kwargs

Return type:

AnnotatedObject

Returns:

max_background_document_size: int = 1000

TODO: more sophisticated way to estimate size of background document.

relevance_factor: float = 0.5

Relevance factor for diversifying search results using MMR.

class curategpt.agents.dase_agent.PredictedFieldValue(**data)

Bases: BaseModel

current_value: Optional[str]
field_predicted: Optional[str]
id: str
model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

original_id: Optional[str]
predicted_value: Optional[str]

curategpt.agents.dragon_agent module

Retrieval Augmented Generation (RAG) Base Class.

class curategpt.agents.dragon_agent.DragonAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=2000, background_document_limit=3, default_masked_fields=<factory>)

Bases: BaseAgent

Retrieves objects in response to a query using a structured knowledge source.

(essentially a structured object autocomplete)

This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.

background_document_limit: int = 3

Number of background documents to use. TODO: more sophisticated way to estimate.

complete(seed, target_class=None, context_property=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)

Populate structured object from partially populated object.

If a string is passed, then an object of form {context_property: seed} is used.

Parameters:
  • seed (Union[str, Dict[str, Any]])

  • target_class (Optional[str])

  • context_property (Optional[str])

  • generate_background

  • collection (Optional[str])

  • rules (Optional[List[str]]) – these are included in the prompt

  • kwargs

Return type:

AnnotatedObject

Returns:

conversation: List[Dict[str, Any]] = None
conversation_mode: bool = False
default_masked_fields: List[str]
default_target_class: ClassVar[str] = 'Thing'
document_adapter: DBAdapter = None

Adapter to supplementary knowledge in unstructured form.

document_adapter_collection: str = None

Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters

generate_all(collection, field_to_predict, missing_only=True, object_ids=None, **kwargs)

Generate missing value for a field for all objects in a collection.

Parameters:
  • collection (str)

  • field_to_predict (str)

  • missing_only

  • object_ids (Optional[Iterable[str]])

  • kwargs

Return type:

Iterable[Tuple[str, str, Any, Any]]

Returns:

generate_queries(context_property='name', n=5, **kwargs)
Return type:

List[str]

max_background_document_size: int = 2000

TODO: more sophisticated way to estimate size of background document.

relevance_factor: float = 0.5

Relevance factor for diversifying search results using MMR.

review(obj, context_property=None, rules=None, collection=None, fields_to_predict=None, primary_key=None, **kwargs)

Review an object for correctness, completeness, and consistency.

Parameters:

obj (dict)

Return type:

AnnotatedObject

class curategpt.agents.dragon_agent.PredictedFieldValue(**data)

Bases: BaseModel

current_value: Optional[str]
field_predicted: Optional[str]
id: str
model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

original_id: Optional[str]
predicted_value: Optional[str]

curategpt.agents.evidence_agent module

class curategpt.agents.evidence_agent.EvidenceAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, chat_agent=None, evidence_update_policy=EvidenceUpdatePolicyEnum.skip)

Bases: BaseAgent

An agent to find evidence for an object by querying a reference source.

The evidence agent is able to find (supporting and refuting) evidence for any of the following:

  • A simple statement in natural language

  • A simple structured dictionary object of key-value pairs

  • A complex structured dictionary object with nested key-value pairs

The default source used is Pubmed, using PubmedWrapper via ChatAgent

chat_agent: Union[ChatAgent, BaseWrapper] = None
evidence_update_policy: EvidenceUpdatePolicyEnum = 'skip'
find_evidence(obj)
Return type:

ChatResponse

find_evidence_complex(obj, label_field=None, statement_fields=None)
Return type:

Dict

find_evidence_simple(query, limit=10, **kwargs)
Return type:

Optional[List[Dict]]

class curategpt.agents.evidence_agent.EvidenceUpdatePolicyEnum(value)

Bases: str, Enum

An enumeration.

append = 'append'
replace = 'replace'
skip = 'skip'

curategpt.agents.huggingface_agent module

class curategpt.agents.huggingface_agent.HuggingFaceAgent(api=None, **_kwargs)

Bases: object

api: HfApi = None
cached_download(repo_id, repo_type, filename)
upload(objects, metadata, repo_id, private=False, **kwargs)

Upload an entire collection to a Hugging Face repository.

Parameters:
  • objects – The objects to upload.

  • metadata – The metadata associated with the collection.

  • repo_id – The repository ID on Hugging Face.

  • private – Whether the repository should be private.

  • kwargs – Additional arguments such as batch size or metadata options.

upload_duckdb(objects, metadata, repo_id, private=False, **kwargs)

Upload an entire collection to a Hugging Face repository.

Parameters:
  • objects – The objects to upload.

  • metadata – The metadata associated with the collection.

  • repo_id – The repository ID on Hugging Face.

  • private – Whether the repository should be private.

  • kwargs – Additional arguments such as batch size or metadata options.

curategpt.agents.mapping_agent module

Chat with a KB.

class curategpt.agents.mapping_agent.Mapping(**data)

Bases: BaseModel

Response from chat engine.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

object_id: str
predicate_id: Optional[MappingPredicate]
subject_id: str
class curategpt.agents.mapping_agent.MappingAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=1.0)

Bases: BaseAgent

An agent to map/align entities.

categorize_mappings(query, kb_results, **kwargs)

Categorize mappings predicate

Parameters:
  • query (Union[str, Dict[str, Any]])

  • kb_results (List[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]])

Return type:

Iterator[Mapping]

Returns:

Find links between elements in this collection and another collection

Parameters:

other_collection (str)

Return type:

Iterator[Tuple[str, str, str]]

Returns:

match(query, limit=None, randomize_order=False, include_predicates=False, fields=None, id_field='id', **kwargs)

Match entities

Parameters:
  • query (Union[str, Dict[str, Any]])

  • limit (Optional[int])

  • randomize_order (bool) – randomize the order in which candidates are presented (mostly for testing purposes)

  • kwargs

Return type:

MappingSet

Returns:

relevance_factor: float = 1.0

Relevance factor for diversifying search results using MMR. high is recommended for this task

class curategpt.agents.mapping_agent.MappingPredicate(value)

Bases: str, Enum

An enumeration.

BROAD_MATCH = 'BROAD_MATCH'
CLOSE_MATCH = 'CLOSE_MATCH'
DIFFERENT_FROM = 'DIFFERENT_FROM'
NARROW_MATCH = 'NARROW_MATCH'
RELATED_MATCH = 'RELATED_MATCH'
SAME_AS = 'SAME_AS'
UNKNOWN = 'UNKNOWN'
class curategpt.agents.mapping_agent.MappingSet(**data)

Bases: BaseModel

mappings: List[Mapping]
model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: str
response_text: str

curategpt.agents.summarization_agent module

class curategpt.agents.summarization_agent.SummarizationAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)

Bases: BaseAgent

An agent to summarize entities

AKA SPINDOCTOR/TALISMAN

summarize(object_ids, description_field, name_field, strict=False, system_prompt=None)

Summarize a list of objects.

Example:

>>> extractor = BasicExtractor()
>>> wrapper = get_wrapper("alliance_gene")
>>> agent = SummarizationAgent(wrapper, extractor=extractor)
>>> gene_ids = ["HGNC:9221", "HGNC:11195", "HGNC:6348", "HGNC:7553"]
>>> response = agent.summarize(
...               gene_ids,
...               name_field="symbol",
...               description_field="automatedGeneSynopsis",
...               system_prompt="What function do these genes have in common?",
...           )
>>> print(response)
type object_ids:

List[str]

param object_ids:

type description_field:

str

param description_field:

type name_field:

str

param name_field:

type strict:

bool

param strict:

type system_prompt:

Optional[str]

param system_prompt:

return:

Module contents

CurateGPT Agents.

These chain together different search and generate components.

class curategpt.agents.ChatAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=0.5, conversation_id=None)

Bases: BaseAgent

An agent that allows chat to a knowledge source.

This implements a standard knowledgebase retrieval augmented generation pattern. The knowledge_source is queried for relevant objects (the source can be a local database or a remote source such as pubmed). The objects are provided as context to a LLM query

chat(query, conversation=None, limit=10, collection=None, expand=True, **kwargs)

Extract structured object using text seed and background knowledge.

Parameters:
  • text

  • kwargs

Return type:

ChatResponse

Returns:

conversation_id: Optional[str] = None
relevance_factor: float = 0.5

Relevance factor for diversifying search results using MMR.

class curategpt.agents.DragonAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=2000, background_document_limit=3, default_masked_fields=<factory>)

Bases: BaseAgent

Retrieves objects in response to a query using a structured knowledge source.

(essentially a structured object autocomplete)

This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.

background_document_limit: int = 3

Number of background documents to use. TODO: more sophisticated way to estimate.

complete(seed, target_class=None, context_property=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)

Populate structured object from partially populated object.

If a string is passed, then an object of form {context_property: seed} is used.

Parameters:
  • seed (Union[str, Dict[str, Any]])

  • target_class (Optional[str])

  • context_property (Optional[str])

  • generate_background

  • collection (Optional[str])

  • rules (Optional[List[str]]) – these are included in the prompt

  • kwargs

Return type:

AnnotatedObject

Returns:

conversation: List[Dict[str, Any]] = None
conversation_mode: bool = False
default_masked_fields: List[str]
default_target_class: ClassVar[str] = 'Thing'
document_adapter: DBAdapter = None

Adapter to supplementary knowledge in unstructured form.

document_adapter_collection: str = None

Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters

generate_all(collection, field_to_predict, missing_only=True, object_ids=None, **kwargs)

Generate missing value for a field for all objects in a collection.

Parameters:
  • collection (str)

  • field_to_predict (str)

  • missing_only

  • object_ids (Optional[Iterable[str]])

  • kwargs

Return type:

Iterable[Tuple[str, str, Any, Any]]

Returns:

generate_queries(context_property='name', n=5, **kwargs)
Return type:

List[str]

max_background_document_size: int = 2000

TODO: more sophisticated way to estimate size of background document.

relevance_factor: float = 0.5

Relevance factor for diversifying search results using MMR.

review(obj, context_property=None, rules=None, collection=None, fields_to_predict=None, primary_key=None, **kwargs)

Review an object for correctness, completeness, and consistency.

Parameters:

obj (dict)

Return type:

AnnotatedObject

class curategpt.agents.EvidenceAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, chat_agent=None, evidence_update_policy=EvidenceUpdatePolicyEnum.skip)

Bases: BaseAgent

An agent to find evidence for an object by querying a reference source.

The evidence agent is able to find (supporting and refuting) evidence for any of the following:

  • A simple statement in natural language

  • A simple structured dictionary object of key-value pairs

  • A complex structured dictionary object with nested key-value pairs

The default source used is Pubmed, using PubmedWrapper via ChatAgent

chat_agent: Union[ChatAgent, BaseWrapper] = None
evidence_update_policy: EvidenceUpdatePolicyEnum = 'skip'
find_evidence(obj)
Return type:

ChatResponse

find_evidence_complex(obj, label_field=None, statement_fields=None)
Return type:

Dict

find_evidence_simple(query, limit=10, **kwargs)
Return type:

Optional[List[Dict]]

class curategpt.agents.MappingAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=1.0)

Bases: BaseAgent

An agent to map/align entities.

categorize_mappings(query, kb_results, **kwargs)

Categorize mappings predicate

Parameters:
  • query (Union[str, Dict[str, Any]])

  • kb_results (List[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]])

Return type:

Iterator[Mapping]

Returns:

Find links between elements in this collection and another collection

Parameters:

other_collection (str)

Return type:

Iterator[Tuple[str, str, str]]

Returns:

match(query, limit=None, randomize_order=False, include_predicates=False, fields=None, id_field='id', **kwargs)

Match entities

Parameters:
  • query (Union[str, Dict[str, Any]])

  • limit (Optional[int])

  • randomize_order (bool) – randomize the order in which candidates are presented (mostly for testing purposes)

  • kwargs

Return type:

MappingSet

Returns:

relevance_factor: float = 1.0

Relevance factor for diversifying search results using MMR. high is recommended for this task