curategpt.agents package
Submodules
curategpt.agents.agent_utils module
Agent Utilities.
- curategpt.agents.agent_utils.select_from_options_prompt(kb_results, model, obj_type='Reference', query=None, prompt_template=None, id_field=None)
Prompt user to select from a list of options.
- Parameters:
kb_results (
List
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]) – order from most relevantmodel (
Model
)obj_type (
str
)query (
Optional
[str
])prompt_template (
Optional
[str
])
- Return type:
Tuple
[str
,Dict
[str
,str
],Dict
]- Returns:
curategpt.agents.base_agent module
Base Agent.
- class curategpt.agents.base_agent.BaseAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)
Bases:
ABC
Base class for agents.
An agent is capable of composing together different actions to achieve a goal.
An agent typically has a knowledge source that is uses to search for information. An agent also has access to a model through an extractor.
-
extractor:
Extractor
= None Engine performing LLM operations, including extracting from prompt responses
-
knowledge_source:
Union
[DBAdapter
,BaseWrapper
] = None A searchable source of information
-
knowledge_source_collection:
str
= None
- search()
-
extractor:
curategpt.agents.bootstrap_agent module
- class curategpt.agents.bootstrap_agent.BootstrapAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)
Bases:
BaseAgent
- bootstrap_data(specification=None, schema=None)
Bootstrap data for a knowledge base.
- Parameters:
specification (
Optional
[KnowledgeBaseSpecification
]) – Specification for the knowledge base.schema (
Optional
[Dict
]) – Schema for the knowledge base.
- Return type:
str
- Returns:
- bootstrap_schema(specification)
Bootstrap a schema for a knowledge base.
- Parameters:
specification (
KnowledgeBaseSpecification
) – Specification for the knowledge base.- Return type:
- Returns:
- class curategpt.agents.bootstrap_agent.KnowledgeBaseSpecification(**data)
Bases:
BaseModel
-
attributes:
str
-
description:
str
-
kb_name:
str
-
main_class:
str
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
attributes:
curategpt.agents.chat_agent module
Chat with a KB.
- class curategpt.agents.chat_agent.ChatAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=0.5, conversation_id=None)
Bases:
BaseAgent
An agent that allows chat to a knowledge source.
This implements a standard knowledgebase retrieval augmented generation pattern. The knowledge_source is queried for relevant objects (the source can be a local database or a remote source such as pubmed). The objects are provided as context to a LLM query
- chat(query, conversation=None, limit=10, collection=None, expand=True, **kwargs)
Extract structured object using text seed and background knowledge.
- Parameters:
text
kwargs
- Return type:
- Returns:
-
conversation_id:
Optional
[str
] = None
-
relevance_factor:
float
= 0.5 Relevance factor for diversifying search results using MMR.
- class curategpt.agents.chat_agent.ChatResponse(**data)
Bases:
BaseModel
Response from chat engine.
TODO: Rename class to indicate that it is provenance-enabled chat
-
body:
str
Text of response.
-
formatted_body:
str
Body formatted with markdown links to references.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
prompt:
str
Prompt used to generate response.
-
references:
Optional
[Dict
[str
,Any
]] References for citations detected in response.
-
uncited_references:
Optional
[Dict
[str
,Any
]] Potential references for which there was no detected citation.
-
body:
- curategpt.agents.chat_agent.replace_references_with_links(text)
Replace references with links.
curategpt.agents.concept_recognition_agent module
Annotation (Concept Recognition) in texts.
- class curategpt.agents.concept_recognition_agent.AnnotatedText(**data)
Bases:
BaseModel
In input text annotated with concept instances.
-
annotated_text:
Optional
[str
] Text with concepts annotated (not all methods produce this).
-
concepts:
Optional
[Dict
[str
,str
]] Dictionary of concepts found in the text. TODO: change to list of spans.
-
input_text:
str
Text that is supplied for annotation.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
prompt:
Optional
[str
] Prompt used to generate the annotated text.
-
summary:
Optional
[str
] Summary of the results.
-
annotated_text:
- class curategpt.agents.concept_recognition_agent.AnnotationMethod(value)
Bases:
str
,Enum
Strategy or algorithm used for CR.
- CONCEPT_LIST = 'concept_list'
LLM creates a list of concepts
- INLINE = 'inline'
LLM creates an annotated document
- TWO_PASS = 'two_pass'
LLM annotates a document using NER and then grounds the concepts
- class curategpt.agents.concept_recognition_agent.ConceptRecognitionAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, identifier_field=None, label_field=None, split_input_text=None, relevance_factor=0.8, prefixes=None)
Bases:
BaseAgent
- annotate(text, collection=None, method=AnnotationMethod.INLINE, **kwargs)
- Return type:
- annotate_concept_list(text, collection=None, categories=None, **kwargs)
- Return type:
- annotate_inline(text, collection=None, categories=None, **kwargs)
- Return type:
- annotate_two_pass(text, collection=None, categories=None, **kwargs)
- Return type:
- ground_concept(text, collection=None, categories=None, include_category_in_search=True, context=None, **kwargs)
- Return type:
-
identifier_field:
str
= None Field to use as identifier for objects.
-
label_field:
str
= None Field to use as label for objects.
-
prefixes:
List
[str
] = None List of prefixes to use for concept IDs.
-
relevance_factor:
float
= 0.8 Relevance factor for diversifying search results using MMR.
-
split_input_text:
bool
= None
- class curategpt.agents.concept_recognition_agent.GroundingResult(**data)
Bases:
BaseModel
Result of grounding text.
-
input_text:
str
Text that is supplied for grounding, assumed to contain a single context.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
score:
Optional
[float
] Score/confidence, from zero to one.
-
input_text:
- class curategpt.agents.concept_recognition_agent.Span(**data)
Bases:
BaseModel
An individual span of text containing a single concept.
-
concept_id:
str
Concept ID.
-
concept_label:
Optional
[str
] Concept label.
-
end:
Optional
[int
]
-
is_suspect:
Optional
[bool
] Potential hallucination due to ID/label mismatch.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
start:
Optional
[int
]
-
text:
str
-
concept_id:
- curategpt.agents.concept_recognition_agent.parse_annotations(text, marker_char=None)
Parse annotations from text.
>>> text = ("A minimum diagnostic criterion is the combination of either the [skin tumours] or multiple " ... "[odontogenic keratocysts HP:0010603] of the jaw plus a positive [family history HP:0032316] " ... "for this disorder, [bifid ribs HP:0000923], lamellar [calcification of falx cerebri HP:0005462] " ... "or any one of the skeletal abnormalities typical of this syndrome") >>> for ann in parse_annotations(text): ... print(ann) ('skin tumours', None) ('odontogenic keratocysts', 'HP:0010603') ('family history', 'HP:0032316') ('bifid ribs', 'HP:0000923') ('calcification of falx cerebri', 'HP:0005462')
For texts with marker characters:
>>> text = "for this disorder, [bifid ribs | HP:0000923], lamellar [calcification of falx cerebri | HP:0005462] " >>> for ann in parse_annotations(text, "|"): ... print(ann) ('bifid ribs', 'HP:0000923') ('calcification of falx cerebri', 'HP:0005462')
- Parameters:
text
- Return type:
List
[Tuple
[str
,str
]]- Returns:
curategpt.agents.dase_agent module
Autocomplete objects using RAG.
- class curategpt.agents.dase_agent.DatabaseAugmentedStructuredExtraction(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=1000, background_document_limit=3, default_masked_fields=<factory>)
Bases:
BaseAgent
Extracts structured objects from unstructured documents.
This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.
-
background_document_limit:
int
= 3 Number of background documents to use. TODO: more sophisticated way to estimate.
-
conversation:
List
[Dict
[str
,Any
]] = None
-
conversation_mode:
bool
= False
-
default_masked_fields:
List
[str
]
-
default_target_class:
ClassVar
[str
] = 'Thing'
-
document_adapter_collection:
str
= None Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters
- extract(text, target_class=None, feature_fields=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)
Populate structured object from text
- Parameters:
seed
target_class (
Optional
[str
])context_property
generate_background
collection (
Optional
[str
])rules (
Optional
[List
[str
]])kwargs
- Return type:
- Returns:
-
max_background_document_size:
int
= 1000 TODO: more sophisticated way to estimate size of background document.
-
relevance_factor:
float
= 0.5 Relevance factor for diversifying search results using MMR.
-
background_document_limit:
- class curategpt.agents.dase_agent.PredictedFieldValue(**data)
Bases:
BaseModel
-
current_value:
Optional
[str
]
-
field_predicted:
Optional
[str
]
-
id:
str
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
original_id:
Optional
[str
]
-
predicted_value:
Optional
[str
]
-
current_value:
curategpt.agents.dragon_agent module
Retrieval Augmented Generation (RAG) Base Class.
- class curategpt.agents.dragon_agent.DragonAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=2000, background_document_limit=3, default_masked_fields=<factory>)
Bases:
BaseAgent
Retrieves objects in response to a query using a structured knowledge source.
(essentially a structured object autocomplete)
This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.
-
background_document_limit:
int
= 3 Number of background documents to use. TODO: more sophisticated way to estimate.
- complete(seed, target_class=None, context_property=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)
Populate structured object from partially populated object.
If a string is passed, then an object of form
{context_property: seed}
is used.- Parameters:
seed (
Union
[str
,Dict
[str
,Any
]])target_class (
Optional
[str
])context_property (
Optional
[str
])generate_background
collection (
Optional
[str
])rules (
Optional
[List
[str
]]) – these are included in the promptkwargs
- Return type:
- Returns:
-
conversation:
List
[Dict
[str
,Any
]] = None
-
conversation_mode:
bool
= False
-
default_masked_fields:
List
[str
]
-
default_target_class:
ClassVar
[str
] = 'Thing'
-
document_adapter_collection:
str
= None Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters
- generate_all(collection, field_to_predict, missing_only=True, object_ids=None, **kwargs)
Generate missing value for a field for all objects in a collection.
- Parameters:
collection (
str
)field_to_predict (
str
)missing_only
object_ids (
Optional
[Iterable
[str
]])kwargs
- Return type:
Iterable
[Tuple
[str
,str
,Any
,Any
]]- Returns:
- generate_queries(context_property='name', n=5, **kwargs)
- Return type:
List
[str
]
-
max_background_document_size:
int
= 2000 TODO: more sophisticated way to estimate size of background document.
-
relevance_factor:
float
= 0.5 Relevance factor for diversifying search results using MMR.
- review(obj, context_property=None, rules=None, collection=None, fields_to_predict=None, primary_key=None, **kwargs)
Review an object for correctness, completeness, and consistency.
- Parameters:
obj (
dict
)- Return type:
-
background_document_limit:
- class curategpt.agents.dragon_agent.PredictedFieldValue(**data)
Bases:
BaseModel
-
current_value:
Optional
[str
]
-
field_predicted:
Optional
[str
]
-
id:
str
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
original_id:
Optional
[str
]
-
predicted_value:
Optional
[str
]
-
current_value:
curategpt.agents.evidence_agent module
- class curategpt.agents.evidence_agent.EvidenceAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, chat_agent=None, evidence_update_policy=EvidenceUpdatePolicyEnum.skip)
Bases:
BaseAgent
An agent to find evidence for an object by querying a reference source.
The evidence agent is able to find (supporting and refuting) evidence for any of the following:
A simple statement in natural language
A simple structured dictionary object of key-value pairs
A complex structured dictionary object with nested key-value pairs
The default source used is Pubmed, using
PubmedWrapper
viaChatAgent
-
chat_agent:
Union
[ChatAgent
,BaseWrapper
] = None
-
evidence_update_policy:
EvidenceUpdatePolicyEnum
= 'skip'
- find_evidence(obj)
- Return type:
- find_evidence_complex(obj, label_field=None, statement_fields=None)
- Return type:
Dict
- find_evidence_simple(query, limit=10, **kwargs)
- Return type:
Optional
[List
[Dict
]]
curategpt.agents.huggingface_agent module
- class curategpt.agents.huggingface_agent.HuggingFaceAgent(api=None, **_kwargs)
Bases:
object
-
api:
HfApi
= None
- cached_download(repo_id, repo_type, filename)
- upload(objects, metadata, repo_id, private=False, **kwargs)
Upload an entire collection to a Hugging Face repository.
- Parameters:
objects – The objects to upload.
metadata – The metadata associated with the collection.
repo_id – The repository ID on Hugging Face.
private – Whether the repository should be private.
kwargs – Additional arguments such as batch size or metadata options.
- upload_duckdb(objects, metadata, repo_id, private=False, **kwargs)
Upload an entire collection to a Hugging Face repository.
- Parameters:
objects – The objects to upload.
metadata – The metadata associated with the collection.
repo_id – The repository ID on Hugging Face.
private – Whether the repository should be private.
kwargs – Additional arguments such as batch size or metadata options.
-
api:
curategpt.agents.mapping_agent module
Chat with a KB.
- class curategpt.agents.mapping_agent.Mapping(**data)
Bases:
BaseModel
Response from chat engine.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
object_id:
str
-
predicate_id:
Optional
[MappingPredicate
]
-
subject_id:
str
- class curategpt.agents.mapping_agent.MappingAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=1.0)
Bases:
BaseAgent
An agent to map/align entities.
- categorize_mappings(query, kb_results, **kwargs)
Categorize mappings predicate
- Parameters:
query (
Union
[str
,Dict
[str
,Any
]])kb_results (
List
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]])
- Return type:
Iterator
[Mapping
]- Returns:
- find_links(other_collection)
Find links between elements in this collection and another collection
- Parameters:
other_collection (
str
)- Return type:
Iterator
[Tuple
[str
,str
,str
]]- Returns:
- match(query, limit=None, randomize_order=False, include_predicates=False, fields=None, id_field='id', **kwargs)
Match entities
- Parameters:
query (
Union
[str
,Dict
[str
,Any
]])limit (
Optional
[int
])randomize_order (
bool
) – randomize the order in which candidates are presented (mostly for testing purposes)kwargs
- Return type:
- Returns:
-
relevance_factor:
float
= 1.0 Relevance factor for diversifying search results using MMR. high is recommended for this task
- class curategpt.agents.mapping_agent.MappingPredicate(value)
Bases:
str
,Enum
An enumeration.
- BROAD_MATCH = 'BROAD_MATCH'
- CLOSE_MATCH = 'CLOSE_MATCH'
- DIFFERENT_FROM = 'DIFFERENT_FROM'
- NARROW_MATCH = 'NARROW_MATCH'
- RELATED_MATCH = 'RELATED_MATCH'
- SAME_AS = 'SAME_AS'
- UNKNOWN = 'UNKNOWN'
curategpt.agents.summarization_agent module
- class curategpt.agents.summarization_agent.SummarizationAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None)
Bases:
BaseAgent
An agent to summarize entities
AKA SPINDOCTOR/TALISMAN
- summarize(object_ids, description_field, name_field, strict=False, system_prompt=None)
Summarize a list of objects.
Example:
>>> extractor = BasicExtractor() >>> wrapper = get_wrapper("alliance_gene") >>> agent = SummarizationAgent(wrapper, extractor=extractor) >>> gene_ids = ["HGNC:9221", "HGNC:11195", "HGNC:6348", "HGNC:7553"] >>> response = agent.summarize( ... gene_ids, ... name_field="symbol", ... description_field="automatedGeneSynopsis", ... system_prompt="What function do these genes have in common?", ... ) >>> print(response)
- type object_ids:
List
[str
]- param object_ids:
- type description_field:
str
- param description_field:
- type name_field:
str
- param name_field:
- type strict:
bool
- param strict:
- type system_prompt:
Optional
[str
]- param system_prompt:
- return:
Module contents
CurateGPT Agents.
These chain together different search and generate components.
- class curategpt.agents.ChatAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=0.5, conversation_id=None)
Bases:
BaseAgent
An agent that allows chat to a knowledge source.
This implements a standard knowledgebase retrieval augmented generation pattern. The knowledge_source is queried for relevant objects (the source can be a local database or a remote source such as pubmed). The objects are provided as context to a LLM query
- chat(query, conversation=None, limit=10, collection=None, expand=True, **kwargs)
Extract structured object using text seed and background knowledge.
- Parameters:
text
kwargs
- Return type:
- Returns:
-
conversation_id:
Optional
[str
] = None
-
relevance_factor:
float
= 0.5 Relevance factor for diversifying search results using MMR.
- class curategpt.agents.DragonAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, document_adapter=None, document_adapter_collection=None, conversation=None, conversation_mode=False, relevance_factor=0.5, max_background_document_size=2000, background_document_limit=3, default_masked_fields=<factory>)
Bases:
BaseAgent
Retrieves objects in response to a query using a structured knowledge source.
(essentially a structured object autocomplete)
This implements a standard knowledgebase retrieval augmented generation pattern; the knowledge_source is queried for relevant objects; these are presented as examples to a LLM query, via an extractor.
-
background_document_limit:
int
= 3 Number of background documents to use. TODO: more sophisticated way to estimate.
- complete(seed, target_class=None, context_property=None, generate_background=False, collection=None, rules=None, fields_to_mask=None, fields_to_predict=None, merge=True, **kwargs)
Populate structured object from partially populated object.
If a string is passed, then an object of form
{context_property: seed}
is used.- Parameters:
seed (
Union
[str
,Dict
[str
,Any
]])target_class (
Optional
[str
])context_property (
Optional
[str
])generate_background
collection (
Optional
[str
])rules (
Optional
[List
[str
]]) – these are included in the promptkwargs
- Return type:
- Returns:
-
conversation:
List
[Dict
[str
,Any
]] = None
-
conversation_mode:
bool
= False
-
default_masked_fields:
List
[str
]
-
default_target_class:
ClassVar
[str
] = 'Thing'
-
document_adapter_collection:
str
= None Collection to use for document adapter. NOTE: may be deprecated as now collections can be bound to adapters
- generate_all(collection, field_to_predict, missing_only=True, object_ids=None, **kwargs)
Generate missing value for a field for all objects in a collection.
- Parameters:
collection (
str
)field_to_predict (
str
)missing_only
object_ids (
Optional
[Iterable
[str
]])kwargs
- Return type:
Iterable
[Tuple
[str
,str
,Any
,Any
]]- Returns:
- generate_queries(context_property='name', n=5, **kwargs)
- Return type:
List
[str
]
-
max_background_document_size:
int
= 2000 TODO: more sophisticated way to estimate size of background document.
-
relevance_factor:
float
= 0.5 Relevance factor for diversifying search results using MMR.
- review(obj, context_property=None, rules=None, collection=None, fields_to_predict=None, primary_key=None, **kwargs)
Review an object for correctness, completeness, and consistency.
- Parameters:
obj (
dict
)- Return type:
-
background_document_limit:
- class curategpt.agents.EvidenceAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, chat_agent=None, evidence_update_policy=EvidenceUpdatePolicyEnum.skip)
Bases:
BaseAgent
An agent to find evidence for an object by querying a reference source.
The evidence agent is able to find (supporting and refuting) evidence for any of the following:
A simple statement in natural language
A simple structured dictionary object of key-value pairs
A complex structured dictionary object with nested key-value pairs
The default source used is Pubmed, using
PubmedWrapper
viaChatAgent
-
chat_agent:
Union
[ChatAgent
,BaseWrapper
] = None
-
evidence_update_policy:
EvidenceUpdatePolicyEnum
= 'skip'
- find_evidence(obj)
- Return type:
- find_evidence_complex(obj, label_field=None, statement_fields=None)
- Return type:
Dict
- find_evidence_simple(query, limit=10, **kwargs)
- Return type:
Optional
[List
[Dict
]]
- class curategpt.agents.MappingAgent(knowledge_source=None, knowledge_source_collection=None, extractor=None, relevance_factor=1.0)
Bases:
BaseAgent
An agent to map/align entities.
- categorize_mappings(query, kb_results, **kwargs)
Categorize mappings predicate
- Parameters:
query (
Union
[str
,Dict
[str
,Any
]])kb_results (
List
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]])
- Return type:
Iterator
[Mapping
]- Returns:
- find_links(other_collection)
Find links between elements in this collection and another collection
- Parameters:
other_collection (
str
)- Return type:
Iterator
[Tuple
[str
,str
,str
]]- Returns:
- match(query, limit=None, randomize_order=False, include_predicates=False, fields=None, id_field='id', **kwargs)
Match entities
- Parameters:
query (
Union
[str
,Dict
[str
,Any
]])limit (
Optional
[int
])randomize_order (
bool
) – randomize the order in which candidates are presented (mostly for testing purposes)kwargs
- Return type:
- Returns:
-
relevance_factor:
float
= 1.0 Relevance factor for diversifying search results using MMR. high is recommended for this task