curategpt.agents package

Submodules

curategpt.agents.agent_utils module

Agent Utilities.

curategpt.agents.agent_utils.select_from_options_prompt(kb_results, model, obj_type='Reference', query=None, prompt_template=None, id_field=None)

Prompt user to select from a list of options.

Parameters:

kb_results (List[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]) – order from most relevant
model (Model)
obj_type (str)
query (Optional[str])
prompt_template (Optional[str])

Return type:

Tuple[str, Dict[str, str], Dict]

Returns:

curategpt.agents.base_agent module

Base Agent.

class curategpt.agents.base_agent.BaseAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)

Bases: ABC

Base class for agents.

An agent is capable of composing together different actions to achieve a goal.

An agent typically has a knowledge source that is uses to search for information. An agent also has access to a model through an extractor.

extractor: Extractor = None: Engine performing LLM operations, including extracting from prompt responses

knowledge_source: Union[DBAdapter, BaseWrapper] = None: A searchable source of information

knowledge_source_collection: str = None

search()

curategpt.agents.bootstrap_agent module

class curategpt.agents.bootstrap_agent.BootstrapAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)

Bases: BaseAgent

bootstrap_data(specification=None, schema=None)

Bootstrap data for a knowledge base.

Parameters:

specification (KnowledgeBaseSpecification) – Specification for the knowledge base.
schema (Dict) – Schema for the knowledge base.

Return type:

str

Returns:

bootstrap_schema(specification)

Bootstrap a schema for a knowledge base.

Parameters:: specification (KnowledgeBaseSpecification) – Specification for the knowledge base.
Return type:: AnnotatedObject
Returns:

class curategpt.agents.bootstrap_agent.KnowledgeBaseSpecification(**data)

Bases: BaseModel

attributes: str

description: str

kb_name: str

main_class: str

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

curategpt.agents.chat_agent module

Chat with a KB.

class curategpt.agents.chat_agent.ChatAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=0.5, conversation_id=None)

Bases: BaseAgent

An agent that allows chat to a knowledge source.

This implements a standard knowledgebase retrieval augmented generation pattern. The knowledge_source is queried for relevant objects (the source can be a local database or a remote source such as pubmed). The objects are provided as context to a LLM query

chat(query, conversation=None, limit=10, collection=None, expand=True, **kwargs)

Extract structured object using text seed and background knowledge.

Parameters:

text
kwargs

Return type:

ChatResponse

Returns:

conversation_id: Optional[str] = None

relevance_factor: float = 0.5: Relevance factor for diversifying search results using MMR.

class curategpt.agents.chat_agent.ChatAgentAlz(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=0.5, conversation_id=None)

Bases: BaseAgent

An agent that allows chat to a knowledge source.

This implements a standard knowledgebase retrieval augmented generation pattern. The knowledge_source is queried for relevant objects (the source can be a local database or a remote source such as pubmed). The objects are provided as context to a LLM query

chat(query, conversation=None, limit=10, collection=None, expand=True, **kwargs)

Return type:: ChatResponse

conversation_id: Optional[str] = None

relevance_factor: float = 0.5: Relevance factor for diversifying search results using MMR.

class curategpt.agents.chat_agent.ChatResponse(**data)

Bases: BaseModel

Response from chat engine.

TODO: Rename class to indicate that it is provenance-enabled chat

body: str: Text of response.

formatted_body: str: Body formatted with markdown links to references.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: str: Prompt used to generate response.

references: Optional[Dict[str, Any]]: References for citations detected in response.

uncited_references: Optional[Dict[str, Any]]: Potential references for which there was no detected citation.

curategpt.agents.chat_agent.replace_references_with_links(text): Replace references with links.

curategpt.agents.concept_recognition_agent module

Annotation (Concept Recognition) in texts.

class curategpt.agents.concept_recognition_agent.AnnotatedText(**data)

Bases: BaseModel

In input text annotated with concept instances.

annotated_text: Optional[str]: Text with concepts annotated (not all methods produce this).

concepts: Optional[Dict[str, str]]: Dictionary of concepts found in the text. TODO: change to list of spans.

input_text: str: Text that is supplied for annotation.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: Optional[str]: Prompt used to generate the annotated text.

spans: Optional[List[Span]]

summary: Optional[str]: Summary of the results.

class curategpt.agents.concept_recognition_agent.AnnotationMethod(*values)

Bases: str, Enum

Strategy or algorithm used for CR.

CONCEPT_LIST = 'concept_list': LLM creates a list of concepts

INLINE = 'inline': LLM creates an annotated document

TWO_PASS = 'two_pass': LLM annotates a document using NER and then grounds the concepts

class curategpt.agents.concept_recognition_agent.ConceptRecognitionAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, identifier_field=None, label_field=None, split_input_text=None, relevance_factor=0.8, prefixes=None)

Bases: BaseAgent

annotate(text, collection=None, method=AnnotationMethod.INLINE, **kwargs)

Return type:: AnnotatedText

annotate_concept_list(text, collection=None, categories=None, **kwargs)

Return type:: AnnotatedText

annotate_inline(text, collection=None, categories=None, **kwargs)

Return type:: AnnotatedText

annotate_two_pass(text, collection=None, categories=None, **kwargs)

Return type:: AnnotatedText

ground_concept(text, collection=None, categories=None, include_category_in_search=True, context=None, **kwargs)

Return type:: GroundingResult

identifier_field: str = None: Field to use as identifier for objects.

label_field: str = None: Field to use as label for objects.

prefixes: List[str] = None: List of prefixes to use for concept IDs.

relevance_factor: float = 0.8: Relevance factor for diversifying search results using MMR.

split_input_text: bool = None

class curategpt.agents.concept_recognition_agent.GroundingResult(**data)

Bases: BaseModel

Result of grounding text.

input_text: str: Text that is supplied for grounding, assumed to contain a single context.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

score: Optional[float]: Score/confidence, from zero to one.

spans: Optional[List[Span]]: Ordered list of candidate spans.

class curategpt.agents.concept_recognition_agent.Span(**data)

Bases: BaseModel

An individual span of text containing a single concept.

concept_id: str: Concept ID.

concept_label: Optional[str]: Concept label.

end: Optional[int]

is_suspect: Optional[bool]: Potential hallucination due to ID/label mismatch.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

start: Optional[int]

text: str

curategpt.agents.concept_recognition_agent.parse_annotations(text, marker_char=None)

Parse annotations from text.

>>> text = ("A minimum diagnostic criterion is the combination of either the [skin tumours] or multiple "
...        "[odontogenic keratocysts HP:0010603] of the jaw plus a positive [family history HP:0032316] "
...        "for this disorder, [bifid ribs HP:0000923], lamellar [calcification of falx cerebri HP:0005462] "
...        "or any one of the skeletal abnormalities typical of this syndrome")
>>> for ann in parse_annotations(text):
...    print(ann)
('skin tumours', None)
('odontogenic keratocysts', 'HP:0010603')
('family history', 'HP:0032316')
('bifid ribs', 'HP:0000923')
('calcification of falx cerebri', 'HP:0005462')

For texts with marker characters:

>>> text = "for this disorder, [bifid ribs | HP:0000923], lamellar [calcification of falx cerebri | HP:0005462] "
>>> for ann in parse_annotations(text, "|"):
...    print(ann)
('bifid ribs', 'HP:0000923')
('calcification of falx cerebri', 'HP:0005462')

Parameters:: text
Return type:: List[Tuple[str, str]]
Returns:

curategpt.agents.concept_recognition_agent.parse_spans(text, concept_dict=None)

Return type:: List[Span]

curategpt.agents.dase_agent module

Autocomplete objects using RAG.

class curategpt.agents.dase_agent.DatabaseAugmentedStructuredExtraction(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=1000, background_document_limit=3, default_masked_fields=<factory>)

Bases: BaseAgent

Extracts structured objects from unstructured documents.

This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.

background_document_limit: int = 3: Number of background documents to use. TODO: more sophisticated way to estimate.

conversation: List[Dict[str, Any]] = None

conversation_mode: bool = False

default_masked_fields: List[str] = <dataclasses._MISSING_TYPE object>

default_target_class: ClassVar[str] = 'Thing'

document_adapter: DBAdapter = None: Adapter to supplementary knowledge in unstructured form.

document_adapter_collection: str = None: Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters

extract(text, target_class=None, feature_fields=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)

Populate structured object from text

Parameters:

seed
target_class (str)
context_property
generate_background
collection (str)
rules (List[str])
kwargs

Return type:

AnnotatedObject

Returns:

max_background_document_size: int = 1000: TODO: more sophisticated way to estimate size of background document.

relevance_factor: float = 0.5: Relevance factor for diversifying search results using MMR.

class curategpt.agents.dase_agent.PredictedFieldValue(**data)

Bases: BaseModel

current_value: Optional[str]

field_predicted: Optional[str]

id: str

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

original_id: Optional[str]

predicted_value: Optional[str]

curategpt.agents.dragon_agent module

Retrieval Augmented Generation (RAG) Base Class.

class curategpt.agents.dragon_agent.DragonAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=2000, background_document_limit=3, default_masked_fields=<factory>)

Bases: BaseAgent

Retrieves objects in response to a query using a structured knowledge source.

(essentially a structured object autocomplete)

This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.

background_document_limit: int = 3: Number of background documents to use. TODO: more sophisticated way to estimate.

complete(seed, target_class=None, context_property=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)

Populate structured object from partially populated object.

If a string is passed, then an object of form {context_property: seed} is used.

Parameters:

seed (Union[str, Dict[str, Any]])
target_class (str)
context_property (str)
generate_background
collection (str)
rules (List[str]) – these are included in the prompt
kwargs

Return type:

AnnotatedObject

Returns:

conversation: List[Dict[str, Any]] = None

conversation_mode: bool = False

default_masked_fields: List[str] = <dataclasses._MISSING_TYPE object>

default_target_class: ClassVar[str] = 'Thing'

document_adapter: DBAdapter = None: Adapter to supplementary knowledge in unstructured form.

document_adapter_collection: str = None: Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters

generate_all(collection, field_to_predict, missing_only=True, object_ids=None, **kwargs)

Generate missing value for a field for all objects in a collection.

Parameters:

collection (str)
field_to_predict (str)
missing_only
object_ids (Optional[Iterable[str]])
kwargs

Return type:

Iterable[Tuple[str, str, Any, Any]]

Returns:

generate_queries(context_property='name', n=5, **kwargs)

Return type:: List[str]

max_background_document_size: int = 2000: TODO: more sophisticated way to estimate size of background document.

relevance_factor: float = 0.5: Relevance factor for diversifying search results using MMR.

review(obj, context_property=None, rules=None, collection=None, fields_to_predict=None, primary_key=None, **kwargs)

Review an object for correctness, completeness, and consistency.

Parameters:: obj (dict)
Return type:: AnnotatedObject

class curategpt.agents.dragon_agent.PredictedFieldValue(**data)

Bases: BaseModel

current_value: Optional[str]

field_predicted: Optional[str]

id: str

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

original_id: Optional[str]

predicted_value: Optional[str]

curategpt.agents.evidence_agent module

class curategpt.agents.evidence_agent.EvidenceAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, chat_agent=None, evidence_update_policy=EvidenceUpdatePolicyEnum.skip)

Bases: BaseAgent

An agent to find evidence for an object by querying a reference source.

The evidence agent is able to find (supporting and refuting) evidence for any of the following:

A simple statement in natural language
A simple structured dictionary object of key-value pairs
A complex structured dictionary object with nested key-value pairs

The default source used is Pubmed, using PubmedWrapper via ChatAgent

chat_agent: Union[ChatAgent, BaseWrapper] = None

evidence_update_policy: EvidenceUpdatePolicyEnum = 'skip'

find_evidence(obj)

Return type:: ChatResponse

find_evidence_complex(obj, label_field=None, statement_fields=None)

Return type:: Dict

find_evidence_simple(query, limit=10, **kwargs)

Return type:: Optional[List[Dict]]

class curategpt.agents.evidence_agent.EvidenceUpdatePolicyEnum(*values)

Bases: str, Enum

append = 'append'

replace = 'replace'

skip = 'skip'

curategpt.agents.huggingface_agent module

class curategpt.agents.huggingface_agent.HuggingFaceAgent(api=None)

Bases: object

api: HfApi = None

cached_download(repo_id, repo_type, filename)

upload(objects, metadata, repo_id, private=False, **kwargs)

Upload an entire collection to a Hugging Face repository.

Parameters:

objects – The objects to upload.
metadata – The metadata associated with the collection.
repo_id – The repository ID on Hugging Face.
private – Whether the repository should be private.
kwargs – Additional arguments such as batch size or metadata options.

upload_duckdb(objects, metadata, repo_id, private=False, **kwargs)

Upload an entire collection to a Hugging Face repository.

Parameters:

objects – The objects to upload.
metadata – The metadata associated with the collection.
repo_id – The repository ID on Hugging Face.
private – Whether the repository should be private.
kwargs – Additional arguments such as batch size or metadata options.

curategpt.agents.mapping_agent module

An agent to map/align entities.

class curategpt.agents.mapping_agent.Mapping(**data)

Bases: BaseModel

Response from chat engine.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

object_id: str

predicate_id: Optional[MappingPredicate]

subject_id: str

class curategpt.agents.mapping_agent.MappingAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=1.0)

Bases: BaseAgent

An agent to map/align entities.

categorize_mappings(query, kb_results, **kwargs)

Categorize mappings predicate

Parameters:

query (Union[str, Dict[str, Any]])
kb_results (List[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]])

Return type:

Iterator[Mapping]

Returns:

find_links(other_collection)

Find links between elements in this collection and another collection

Parameters:: other_collection (str)
Return type:: Iterator[Tuple[str, str, str]]
Returns:

match(query, limit=None, randomize_order=False, include_predicates=False, fields=None, id_field='id', **kwargs)

Match entities

Parameters:

query (Union[str, Dict[str, Any]])
limit (int)
randomize_order (bool) – randomize the order in which candidates are presented (mostly for testing purposes)
kwargs

Return type:

MappingSet

Returns:

relevance_factor: float = 1.0: Relevance factor for diversifying search results using MMR. high is recommended for this task

class curategpt.agents.mapping_agent.MappingPredicate(*values)

Bases: str, Enum

BROAD_MATCH = 'BROAD_MATCH'

CLOSE_MATCH = 'CLOSE_MATCH'

DIFFERENT_FROM = 'DIFFERENT_FROM'

NARROW_MATCH = 'NARROW_MATCH'

RELATED_MATCH = 'RELATED_MATCH'

SAME_AS = 'SAME_AS'

UNKNOWN = 'UNKNOWN'

class curategpt.agents.mapping_agent.MappingSet(**data)

Bases: BaseModel

mappings: List[Mapping]

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

prompt: str

response_text: str

curategpt.agents.summarization_agent module

Agent to summarize entities

class curategpt.agents.summarization_agent.SummarizationAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)

Bases: BaseAgent

An agent to summarize entities

AKA SPINDOCTOR/TALISMAN

summarize(object_ids, description_field, name_field, strict=False, system_prompt=None)

Summarize a list of objects.

Example:

>>> extractor = BasicExtractor()
>>> wrapper = get_wrapper("alliance_gene")
>>> agent = SummarizationAgent(wrapper, extractor=extractor)
>>> gene_ids = ["HGNC:9221", "HGNC:11195", "HGNC:6348", "HGNC:7553"]
>>> response = agent.summarize(
...               gene_ids,
...               name_field="symbol",
...               description_field="automatedGeneSynopsis",
...               system_prompt="What function do these genes have in common?",
...           )
>>> print(response)

type object_ids:: List[str]
param object_ids:
type description_field:: str
param description_field:
type name_field:: str
param name_field:
type strict:: bool
param strict:
type system_prompt:: str
param system_prompt:
return:

Module contents

CurateGPT Agents.

These chain together different search and generate components.

class curategpt.agents.ChatAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=0.5, conversation_id=None)

Bases: BaseAgent

An agent that allows chat to a knowledge source.

This implements a standard knowledgebase retrieval augmented generation pattern. The knowledge_source is queried for relevant objects (the source can be a local database or a remote source such as pubmed). The objects are provided as context to a LLM query

chat(query, conversation=None, limit=10, collection=None, expand=True, **kwargs)

Extract structured object using text seed and background knowledge.

Parameters:

text
kwargs

Return type:

ChatResponse

Returns:

conversation_id: Optional[str] = None

relevance_factor: float = 0.5: Relevance factor for diversifying search results using MMR.

class curategpt.agents.DragonAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=2000, background_document_limit=3, default_masked_fields=<factory>)

Bases: BaseAgent

Retrieves objects in response to a query using a structured knowledge source.

(essentially a structured object autocomplete)

This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.

background_document_limit: int = 3: Number of background documents to use. TODO: more sophisticated way to estimate.

complete(seed, target_class=None, context_property=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)

Populate structured object from partially populated object.

If a string is passed, then an object of form {context_property: seed} is used.

Parameters:

seed (Union[str, Dict[str, Any]])
target_class (str)
context_property (str)
generate_background
collection (str)
rules (List[str]) – these are included in the prompt
kwargs

Return type:

AnnotatedObject

Returns:

conversation: List[Dict[str, Any]] = None

conversation_mode: bool = False

default_masked_fields: List[str] = <dataclasses._MISSING_TYPE object>

default_target_class: ClassVar[str] = 'Thing'

document_adapter: DBAdapter = None: Adapter to supplementary knowledge in unstructured form.

document_adapter_collection: str = None: Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters

generate_all(collection, field_to_predict, missing_only=True, object_ids=None, **kwargs)

Generate missing value for a field for all objects in a collection.

Parameters:

collection (str)
field_to_predict (str)
missing_only
object_ids (Optional[Iterable[str]])
kwargs

Return type:

Iterable[Tuple[str, str, Any, Any]]

Returns:

generate_queries(context_property='name', n=5, **kwargs)

Return type:: List[str]

max_background_document_size: int = 2000: TODO: more sophisticated way to estimate size of background document.

relevance_factor: float = 0.5: Relevance factor for diversifying search results using MMR.

review(obj, context_property=None, rules=None, collection=None, fields_to_predict=None, primary_key=None, **kwargs)

Review an object for correctness, completeness, and consistency.

Parameters:: obj (dict)
Return type:: AnnotatedObject

class curategpt.agents.EvidenceAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, chat_agent=None, evidence_update_policy=EvidenceUpdatePolicyEnum.skip)

Bases: BaseAgent

An agent to find evidence for an object by querying a reference source.

The evidence agent is able to find (supporting and refuting) evidence for any of the following:

A simple statement in natural language
A simple structured dictionary object of key-value pairs
A complex structured dictionary object with nested key-value pairs

The default source used is Pubmed, using PubmedWrapper via ChatAgent

chat_agent: Union[ChatAgent, BaseWrapper] = None

evidence_update_policy: EvidenceUpdatePolicyEnum = 'skip'

find_evidence(obj)

Return type:: ChatResponse

find_evidence_complex(obj, label_field=None, statement_fields=None)

Return type:: Dict

find_evidence_simple(query, limit=10, **kwargs)

Return type:: Optional[List[Dict]]

class curategpt.agents.MappingAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=1.0)

Bases: BaseAgent

An agent to map/align entities.

categorize_mappings(query, kb_results, **kwargs)

Categorize mappings predicate

Parameters:

query (Union[str, Dict[str, Any]])
kb_results (List[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]])

Return type:

Iterator[Mapping]

Returns:

find_links(other_collection)

Find links between elements in this collection and another collection

Parameters:: other_collection (str)
Return type:: Iterator[Tuple[str, str, str]]
Returns:

match(query, limit=None, randomize_order=False, include_predicates=False, fields=None, id_field='id', **kwargs)

Match entities

Parameters:

query (Union[str, Dict[str, Any]])
limit (int)
randomize_order (bool) – randomize the order in which candidates are presented (mostly for testing purposes)
kwargs

Return type:

MappingSet

Returns:

relevance_factor: float = 1.0: Relevance factor for diversifying search results using MMR. high is recommended for this task