curategpt package
Subpackages
- curategpt.adhoc package
- Submodules
- curategpt.adhoc.gocam_predictor module
GOCAMPredictor
GOCAMPredictor.collection_name
GOCAMPredictor.database_path
GOCAMPredictor.database_type
GOCAMPredictor.extractor
GOCAMPredictor.fix_yaml()
GOCAMPredictor.gocam_by_id()
GOCAMPredictor.gocam_wrapper
GOCAMPredictor.include_standard_annotations
GOCAMPredictor.model_name
GOCAMPredictor.predict_activity_unit()
GOCAMPredictor.store
GOCAMPredictor.strict
- Module contents
- curategpt.agents package
- Submodules
- curategpt.agents.agent_utils module
- curategpt.agents.base_agent module
- curategpt.agents.bootstrap_agent module
- curategpt.agents.chat_agent module
- curategpt.agents.concept_recognition_agent module
AnnotatedText
AnnotationMethod
ConceptRecognitionAgent
ConceptRecognitionAgent.annotate()
ConceptRecognitionAgent.annotate_concept_list()
ConceptRecognitionAgent.annotate_inline()
ConceptRecognitionAgent.annotate_two_pass()
ConceptRecognitionAgent.ground_concept()
ConceptRecognitionAgent.identifier_field
ConceptRecognitionAgent.label_field
ConceptRecognitionAgent.prefixes
ConceptRecognitionAgent.relevance_factor
ConceptRecognitionAgent.split_input_text
GroundingResult
Span
parse_annotations()
parse_spans()
- curategpt.agents.dase_agent module
DatabaseAugmentedStructuredExtraction
DatabaseAugmentedStructuredExtraction.background_document_limit
DatabaseAugmentedStructuredExtraction.conversation
DatabaseAugmentedStructuredExtraction.conversation_mode
DatabaseAugmentedStructuredExtraction.default_masked_fields
DatabaseAugmentedStructuredExtraction.default_target_class
DatabaseAugmentedStructuredExtraction.document_adapter
DatabaseAugmentedStructuredExtraction.document_adapter_collection
DatabaseAugmentedStructuredExtraction.extract()
DatabaseAugmentedStructuredExtraction.max_background_document_size
DatabaseAugmentedStructuredExtraction.relevance_factor
PredictedFieldValue
- curategpt.agents.dragon_agent module
DragonAgent
DragonAgent.background_document_limit
DragonAgent.complete()
DragonAgent.conversation
DragonAgent.conversation_mode
DragonAgent.default_masked_fields
DragonAgent.default_target_class
DragonAgent.document_adapter
DragonAgent.document_adapter_collection
DragonAgent.generate_all()
DragonAgent.generate_queries()
DragonAgent.max_background_document_size
DragonAgent.relevance_factor
DragonAgent.review()
PredictedFieldValue
- curategpt.agents.evidence_agent module
- curategpt.agents.huggingface_agent module
- curategpt.agents.mapping_agent module
- curategpt.agents.summarization_agent module
- Module contents
ChatAgent
DragonAgent
DragonAgent.background_document_limit
DragonAgent.complete()
DragonAgent.conversation
DragonAgent.conversation_mode
DragonAgent.default_masked_fields
DragonAgent.default_target_class
DragonAgent.document_adapter
DragonAgent.document_adapter_collection
DragonAgent.generate_all()
DragonAgent.generate_queries()
DragonAgent.max_background_document_size
DragonAgent.relevance_factor
DragonAgent.review()
EvidenceAgent
MappingAgent
- curategpt.app package
- curategpt.conf package
- curategpt.evaluation package
- Submodules
- curategpt.evaluation.base_evaluator module
- curategpt.evaluation.calc_statistics module
- curategpt.evaluation.dae_evaluator module
- curategpt.evaluation.evaluation_datamodel module
AggregationMethod
ClassificationMetrics
ClassificationMetrics.accuracy
ClassificationMetrics.f1_score
ClassificationMetrics.false_negatives
ClassificationMetrics.false_positives
ClassificationMetrics.model_computed_fields
ClassificationMetrics.model_config
ClassificationMetrics.model_fields
ClassificationMetrics.precision
ClassificationMetrics.recall
ClassificationMetrics.specificity
ClassificationMetrics.true_negatives
ClassificationMetrics.true_positives
ClassificationOutcome
StratifiedCollection
StratifiedCollection.model_computed_fields
StratifiedCollection.model_config
StratifiedCollection.model_fields
StratifiedCollection.source
StratifiedCollection.testing_set
StratifiedCollection.testing_set_collection
StratifiedCollection.training_set
StratifiedCollection.training_set_collection
StratifiedCollection.validation_set
StratifiedCollection.validation_set_collection
Task
Task.additional_collections
Task.agent
Task.embedding_model_name
Task.executed_on
Task.extractor
Task.fields_to_mask
Task.fields_to_predict
Task.generate_background
Task.id
Task.method
Task.model_computed_fields
Task.model_config
Task.model_fields
Task.model_name
Task.num_testing
Task.num_training
Task.num_validation
Task.report_path
Task.results
Task.source_collection
Task.source_db_path
Task.stratified_collection
Task.target_db_path
Task.task_finished
Task.task_started
Task.working_directory
- curategpt.evaluation.runner module
- curategpt.evaluation.splitter module
- Module contents
- curategpt.extract package
- curategpt.formatters package
- curategpt.store package
- Submodules
- curategpt.store.chromadb_adapter module
ChromaDBAdapter
ChromaDBAdapter.client
ChromaDBAdapter.collection_metadata()
ChromaDBAdapter.collections()
ChromaDBAdapter.default_max_document_length
ChromaDBAdapter.default_model
ChromaDBAdapter.diversified_search()
ChromaDBAdapter.dump_then_load()
ChromaDBAdapter.fetch_all_objects_memory_safe()
ChromaDBAdapter.find()
ChromaDBAdapter.id_field
ChromaDBAdapter.id_to_object
ChromaDBAdapter.insert()
ChromaDBAdapter.list_collection_names()
ChromaDBAdapter.lookup()
ChromaDBAdapter.matches()
ChromaDBAdapter.name
ChromaDBAdapter.peek()
ChromaDBAdapter.populate_venomx()
ChromaDBAdapter.remove_collection()
ChromaDBAdapter.reset()
ChromaDBAdapter.search()
ChromaDBAdapter.set_collection_metadata()
ChromaDBAdapter.text_lookup
ChromaDBAdapter.update()
ChromaDBAdapter.update_collection_metadata()
ChromaDBAdapter.upsert()
- curategpt.store.db_adapter module
DBAdapter
DBAdapter.collection
DBAdapter.collection_metadata()
DBAdapter.create_view()
DBAdapter.delete()
DBAdapter.dump()
DBAdapter.dump_then_load()
DBAdapter.fetch_all_objects_memory_safe()
DBAdapter.field_names()
DBAdapter.find()
DBAdapter.identifier_field()
DBAdapter.insert()
DBAdapter.label_field()
DBAdapter.list_collection_names()
DBAdapter.lookup()
DBAdapter.lookup_multiple()
DBAdapter.matches()
DBAdapter.name
DBAdapter.path
DBAdapter.peek()
DBAdapter.remove_collection()
DBAdapter.schema_proxy
DBAdapter.search()
DBAdapter.set_collection()
DBAdapter.set_collection_metadata()
DBAdapter.update()
DBAdapter.update_collection_metadata()
DBAdapter.upsert()
- curategpt.store.db_metadata module
- curategpt.store.duckdb_adapter module
DuckDBAdapter
DuckDBAdapter.M
DuckDBAdapter.collection_metadata()
DuckDBAdapter.conn
DuckDBAdapter.create_index()
DuckDBAdapter.default_max_document_length
DuckDBAdapter.default_model
DuckDBAdapter.determine_fields_to_include()
DuckDBAdapter.distance_metric
DuckDBAdapter.dump_then_load()
DuckDBAdapter.ef_construction
DuckDBAdapter.ef_search
DuckDBAdapter.fetch_all_objects_memory_safe()
DuckDBAdapter.find()
DuckDBAdapter.get_raw_objects()
DuckDBAdapter.id_field
DuckDBAdapter.id_to_object
DuckDBAdapter.identifier_field()
DuckDBAdapter.insert()
DuckDBAdapter.kill_process()
DuckDBAdapter.list_collection_names()
DuckDBAdapter.lookup()
DuckDBAdapter.matches()
DuckDBAdapter.name
DuckDBAdapter.openai_client
DuckDBAdapter.parse_duckdb_result()
DuckDBAdapter.peek()
DuckDBAdapter.populate_venomx()
DuckDBAdapter.remove_collection()
DuckDBAdapter.search()
DuckDBAdapter.set_collection_metadata()
DuckDBAdapter.text_lookup
DuckDBAdapter.update()
DuckDBAdapter.update_collection_metadata()
DuckDBAdapter.update_or_create_venomx()
DuckDBAdapter.upsert()
DuckDBAdapter.vec_dimension
- curategpt.store.duckdb_connection_handler module
- curategpt.store.duckdb_result module
DuckDBSearchResult
DuckDBSearchResult.distances
DuckDBSearchResult.documents
DuckDBSearchResult.embeddings
DuckDBSearchResult.ids
DuckDBSearchResult.include
DuckDBSearchResult.metadatas
DuckDBSearchResult.model_computed_fields
DuckDBSearchResult.model_config
DuckDBSearchResult.model_fields
DuckDBSearchResult.to_dict()
DuckDBSearchResult.to_json()
- curategpt.store.in_memory_adapter module
Collection
CollectionIndex
InMemoryAdapter
InMemoryAdapter.collection_index
InMemoryAdapter.collection_metadata()
InMemoryAdapter.delete()
InMemoryAdapter.fetch_all_objects_memory_safe()
InMemoryAdapter.find()
InMemoryAdapter.insert()
InMemoryAdapter.list_collection_names()
InMemoryAdapter.lookup()
InMemoryAdapter.matches()
InMemoryAdapter.name
InMemoryAdapter.peek()
InMemoryAdapter.populate_venomx()
InMemoryAdapter.remove_collection()
InMemoryAdapter.search()
InMemoryAdapter.set_collection_metadata()
InMemoryAdapter.update()
InMemoryAdapter.update_collection_metadata()
InMemoryAdapter.upsert()
- curategpt.store.metadata module
- curategpt.store.schema_proxy module
- curategpt.store.vocab module
- Module contents
ChromaDBAdapter
ChromaDBAdapter.client
ChromaDBAdapter.collection_metadata()
ChromaDBAdapter.collections()
ChromaDBAdapter.default_max_document_length
ChromaDBAdapter.default_model
ChromaDBAdapter.diversified_search()
ChromaDBAdapter.dump_then_load()
ChromaDBAdapter.fetch_all_objects_memory_safe()
ChromaDBAdapter.find()
ChromaDBAdapter.id_field
ChromaDBAdapter.id_to_object
ChromaDBAdapter.insert()
ChromaDBAdapter.list_collection_names()
ChromaDBAdapter.lookup()
ChromaDBAdapter.matches()
ChromaDBAdapter.name
ChromaDBAdapter.peek()
ChromaDBAdapter.populate_venomx()
ChromaDBAdapter.remove_collection()
ChromaDBAdapter.reset()
ChromaDBAdapter.search()
ChromaDBAdapter.set_collection_metadata()
ChromaDBAdapter.text_lookup
ChromaDBAdapter.update()
ChromaDBAdapter.update_collection_metadata()
ChromaDBAdapter.upsert()
DBAdapter
DBAdapter.collection
DBAdapter.collection_metadata()
DBAdapter.create_view()
DBAdapter.delete()
DBAdapter.dump()
DBAdapter.dump_then_load()
DBAdapter.fetch_all_objects_memory_safe()
DBAdapter.field_names()
DBAdapter.find()
DBAdapter.identifier_field()
DBAdapter.insert()
DBAdapter.label_field()
DBAdapter.list_collection_names()
DBAdapter.lookup()
DBAdapter.lookup_multiple()
DBAdapter.matches()
DBAdapter.name
DBAdapter.path
DBAdapter.peek()
DBAdapter.remove_collection()
DBAdapter.schema_proxy
DBAdapter.search()
DBAdapter.set_collection()
DBAdapter.set_collection_metadata()
DBAdapter.update()
DBAdapter.update_collection_metadata()
DBAdapter.upsert()
DuckDBAdapter
DuckDBAdapter.M
DuckDBAdapter.collection_metadata()
DuckDBAdapter.conn
DuckDBAdapter.create_index()
DuckDBAdapter.default_max_document_length
DuckDBAdapter.default_model
DuckDBAdapter.determine_fields_to_include()
DuckDBAdapter.distance_metric
DuckDBAdapter.dump_then_load()
DuckDBAdapter.ef_construction
DuckDBAdapter.ef_search
DuckDBAdapter.fetch_all_objects_memory_safe()
DuckDBAdapter.find()
DuckDBAdapter.get_raw_objects()
DuckDBAdapter.id_field
DuckDBAdapter.id_to_object
DuckDBAdapter.identifier_field()
DuckDBAdapter.insert()
DuckDBAdapter.kill_process()
DuckDBAdapter.list_collection_names()
DuckDBAdapter.lookup()
DuckDBAdapter.matches()
DuckDBAdapter.name
DuckDBAdapter.openai_client
DuckDBAdapter.parse_duckdb_result()
DuckDBAdapter.peek()
DuckDBAdapter.populate_venomx()
DuckDBAdapter.remove_collection()
DuckDBAdapter.search()
DuckDBAdapter.set_collection_metadata()
DuckDBAdapter.text_lookup
DuckDBAdapter.update()
DuckDBAdapter.update_collection_metadata()
DuckDBAdapter.update_or_create_venomx()
DuckDBAdapter.upsert()
DuckDBAdapter.vec_dimension
Metadata
SchemaProxy
get_store()
- curategpt.utils package
- Submodules
- curategpt.utils.eval_utils module
Outcome
Outcome.append_outcomes()
Outcome.by_field
Outcome.calculate_metrics()
Outcome.expected
Outcome.f1
Outcome.flatten()
Outcome.fn
Outcome.fp
Outcome.ixn_by_field
Outcome.model_computed_fields
Outcome.model_config
Outcome.model_fields
Outcome.parameters
Outcome.precision
Outcome.prediction
Outcome.recall
Outcome.tn
Outcome.tp
best_matches()
score_prediction()
- curategpt.utils.llm_utils module
- curategpt.utils.patch_utils module
- curategpt.utils.search module
- curategpt.utils.tokens module
- curategpt.utils.vector_algorithms module
- curategpt.utils.vectordb_operations module
- Module contents
- curategpt.views package
- curategpt.wrappers package
- Subpackages
- curategpt.wrappers.bio package
- Submodules
- curategpt.wrappers.bio.alliance_gene_wrapper module
- curategpt.wrappers.bio.bacdive_wrapper module
- curategpt.wrappers.bio.gocam_wrapper module
- curategpt.wrappers.bio.mediadive_wrapper module
- curategpt.wrappers.bio.omicsdi_wrapper module
- curategpt.wrappers.bio.reactome_wrapper module
- curategpt.wrappers.bio.uniprot_wrapper module
- Module contents
- curategpt.wrappers.clinical package
- curategpt.wrappers.general package
- Submodules
- curategpt.wrappers.general.filesystem_wrapper module
- curategpt.wrappers.general.github_wrapper module
- curategpt.wrappers.general.google_drive_wrapper module
- curategpt.wrappers.general.gspread_wrapper module
- curategpt.wrappers.general.json_wrapper module
- curategpt.wrappers.general.linkml_schema_wrapper module
- Module contents
- curategpt.wrappers.investigation package
- Submodules
- curategpt.wrappers.investigation.ess_deepdive_wrapper module
- curategpt.wrappers.investigation.fairsharing_wrapper module
- curategpt.wrappers.investigation.jgi_wrapper module
- curategpt.wrappers.investigation.ncbi_bioproject_wrapper module
- curategpt.wrappers.investigation.ncbi_biosample_wrapper module
- curategpt.wrappers.investigation.nmdc_wrapper module
- Module contents
- curategpt.wrappers.legal package
- curategpt.wrappers.literature package
- curategpt.wrappers.ontology package
- curategpt.wrappers.bio package
- Submodules
- curategpt.wrappers.base_wrapper module
BaseWrapper
BaseWrapper.chat()
BaseWrapper.create_curie()
BaseWrapper.default_embedding_model
BaseWrapper.default_object_type
BaseWrapper.external_search()
BaseWrapper.extract_concepts_from_text()
BaseWrapper.extractor
BaseWrapper.local_store
BaseWrapper.max_text_length
BaseWrapper.name
BaseWrapper.objects()
BaseWrapper.objects_by_ids()
BaseWrapper.prefix
BaseWrapper.search()
BaseWrapper.search_limit_multiplier
BaseWrapper.source_locator
BaseWrapper.split_objects()
BaseWrapper.text_overlap
BaseWrapper.unwrap_object()
BaseWrapper.wrap_object()
- Module contents
BaseWrapper
BaseWrapper.chat()
BaseWrapper.create_curie()
BaseWrapper.default_embedding_model
BaseWrapper.default_object_type
BaseWrapper.external_search()
BaseWrapper.extract_concepts_from_text()
BaseWrapper.extractor
BaseWrapper.local_store
BaseWrapper.max_text_length
BaseWrapper.name
BaseWrapper.objects()
BaseWrapper.objects_by_ids()
BaseWrapper.prefix
BaseWrapper.search()
BaseWrapper.search_limit_multiplier
BaseWrapper.source_locator
BaseWrapper.split_objects()
BaseWrapper.text_overlap
BaseWrapper.unwrap_object()
BaseWrapper.wrap_object()
get_wrapper()
- Subpackages
Submodules
curategpt.cli module
Command line interface for curategpt.
Module contents
CurateGPT: A framework semi-assisted curation of knowledge bases.
Architecture
store
: json object stores that allow for embedding based searchwrappers
: wraps external APIs and data sources for ingestextract
: extraction of json objects from LLMsagents
: agents that chain together search and generate componentsformatters
: formats data objects for presentation to humans and machine agentsapp
: streamlit application
- class curategpt.BasicExtractor(schema_proxy=None, model_name='gpt-4o', api_key=None, raise_error_if_unparsable=False, serialization_format='json')
Bases:
Extractor
Extractor that is purely example driven.
- deserialize(text, format=None, **kwargs)
Deserialize text into an annotated object
- Parameters:
text (
str
)- Return type:
- Returns:
- deserialize_yaml(text, multiple=False)
- Return type:
- extract(text, target_class, examples=None, background_text=None, rules=None, min_examples=1, **kwargs)
Schema-guided extraction
- Parameters:
text (
str
)kwargs
- Return type:
- Returns:
- model_config = {'protected_namespaces': ()}
-
model_name:
str
= 'gpt-4o'
-
serialization_format:
str
= 'json'
- serialize(ao)
- Return type:
str
- class curategpt.ChromaDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, default_model='all-MiniLM-L6-v2', client=None, id_field='id', text_lookup='text', id_to_object=<factory>, **_kwargs)
Bases:
DBAdapter
An Adapter that wraps a ChromaDB client
-
client:
ClientAPI
= None
- collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for a collection.
- Parameters:
collection_name (
Optional
[str
])- Return type:
Optional
[Metadata
]- Returns:
Parameters
- collections()
Return the names of all collections in the database.
- Return type:
Iterator
[str
]- Returns:
-
default_max_document_length:
ClassVar
[int
] = 6000
-
default_model:
str
= 'all-MiniLM-L6-v2'
- diversified_search(text=None, limit=None, relevance_factor=0.5, collection=None, **kwargs)
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]
- dump_then_load(collection=None, target=None)
Dump a collection to a file, then load it into another database.
- Parameters:
collection (
Optional
[str
])target (
Optional
[DBAdapter
])
- Returns:
- fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- find(where=None, projection=None, collection=None, **kwargs)
Query the database.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
- Parameters:
collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])projection (
Union
[str
,List
[str
],None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
id_field:
str
= 'id'
-
id_to_object:
Mapping
[str
,Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- insert(objs, **kwargs)
Insert an object or list of objects into the store.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
- list_collection_names()
List all collections in the database.
- Return type:
List
[str
]- Returns:
- lookup(id, collection=None, **kwargs)
Lookup an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Return type:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]- Returns:
- matches(obj, **kwargs)
Query the database for matches to an object.
- Parameters:
obj (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
name:
ClassVar
[str
] = 'chromadb'
- peek(collection=None, limit=5, offset=0, **kwargs)
Peek at first N objects in a collection.
- Parameters:
collection (
Optional
[str
])limit
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- static populate_venomx(collection, model, existing_venomx)
Populate venomx with data currently given when inserting
- Parameters:
collection (
Optional
[str
])model (
Optional
[str
])
:param existing_venomx :rtype:
Index
:return:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove a collection from the database.
- Parameters:
collection (
Optional
[str
])exists_ok
- Returns:
- reset()
Reset/delete the database.
- search(text, **kwargs)
Query the database for a text string.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"): ... obj_id = obj["id"] ... # print at precision of 2 decimal places ... print(f"{obj_id} {distance:.2f}") ... NeuronOfTheForebrain 0.28 ...
- Parameters:
text (
str
)collection
where
kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
tuple of object, distance, metadata
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for a collection.
-
text_lookup:
Union
[str
,Callable
,None
] = 'text'
- update(objs, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
- update_collection_metadata(collection_name, **kwargs)
Update the metadata for a collection based on the adapter.
- Parameters:
collection_name (
str
) – Name of the collection.kwargs – Additional metadata fields.
- Return type:
- Returns:
Updated Metadata instance.
- upsert(objs, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
-
client:
- class curategpt.DBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None)
Bases:
ABC
Base class for stores.
This base class provides a common interface for a wide variety of document or object stores. The interface is intended to closely mimic the kind of interface found for document stores such as mongoDB or vector databases such as ChromaDB, but the intention is that can be used for SQL databases, SPARQL endpoints, or even file systems.
The store allows for storage and retrieval of objects which are arbitrary dictionary objects, equivalient to a JSON object.
Objects are partitioned into collections, which maps to the equivalent concept in MongoDB and ChromaDB.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert({"name": "John", "age": 42}, collection="people")
If you are used to working with MongoDB and ChromaDB APIs directly, one difference is that here we do not provide a separate Collection object, everything is handled through the store object. You can optionally bind a store object to a collection, which effectively gives you a collection object:
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.set_collection("people") >>> store.insert({"name": "John", "age": 42})
TODO: decide if this is the final interface
-
collection:
Optional
[str
] = None Default collection
- abstract collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for a collection.
- Parameters:
collection_name (
Optional
[str
])include_derived – Include derived metadata, e.g. counts
- Return type:
Optional
[Metadata
]- Returns:
- create_view(view_name, collection, expression, **kwargs)
Create a view in the database.
Todo:
- param view:
- return:
- delete(id, collection=None, **kwargs)
Delete an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Returns:
- dump(collection=None, to_file=None, metadata_to_file=None, format=None, include=None, **kwargs)
Dump the database to a file.
- Parameters:
collection (
Optional
[str
])kwargs
- Returns:
- dump_then_load(collection=None, target=None)
Dump a collection to a file, then load it into another database.
- Parameters:
collection (
Optional
[str
])target (
Optional
[DBAdapter
])
- Returns:
- abstract fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- field_names(collection=None)
Return the names of all top level fields in the database for a collection.
- Parameters:
collection (
Optional
[str
])- Return type:
List
[str
]- Returns:
- find(where=None, projection=None, collection=None, **kwargs)
Query the database.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
- Parameters:
collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])projection (
Union
[str
,List
[str
],None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
- identifier_field(collection=None)
- Return type:
str
- abstract insert(objs, collection=None, **kwargs)
Insert an object or list of objects into the store.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
- label_field(collection=None)
- Return type:
str
- abstract list_collection_names()
List all collections in the database.
- Return type:
List
[str
]- Returns:
names of collections
- abstract lookup(id, collection=None, **kwargs)
Lookup an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Return type:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]- Returns:
- lookup_multiple(ids, **kwargs)
Lookup an object by its ID.
- Parameters:
id
collection
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- abstract matches(obj, **kwargs)
Query the database for matches to an object.
- Parameters:
obj (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
name:
ClassVar
[str
] = 'base'
-
path:
str
= None Path to a location where the database is stored or disk or the network.
- abstract peek(collection=None, limit=5, **kwargs)
Peek at first N objects in a collection.
- Parameters:
collection (
Optional
[str
])limit
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove a collection from the database.
- Parameters:
collection (
Optional
[str
])- Returns:
-
schema_proxy:
Optional
[SchemaProxy
] = None Schema manager
- abstract search(text, where=None, collection=None, **kwargs)
Query the database for a text string.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"): ... obj_id = obj["id"] ... # print at precision of 2 decimal places ... print(f"{obj_id} {distance:.2f}") ... NeuronOfTheForebrain 0.28 ...
- Parameters:
text (
str
)collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
tuple of object, distance, metadata
- set_collection(collection)
Set the current collection.
If this is set, then all subsequent operations will be performed on this collection, unless overridden.
This allows the following
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.set_collection("people") >>> store.insert([{"name": "John", "age": 42}])
to be written in place of
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
collection (
str
)- Returns:
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for a collection.
>>> from curategpt.store import get_store >>> from curategpt.store import Metadata >>> store = get_store("in_memory") >>> md = store.collection_metadata(collection) >>> md.venomx.id == "People" >>> md.venomx.embedding_model.name == "openai:" >>> store.set_collection_metadata("people", cm)
- Parameters:
collection_name (
Optional
[str
])- Returns:
- update(objs, collection=None, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
- update_collection_metadata(collection_name, **kwargs)
Update the metadata for a collection.
- Parameters:
collection_name (
str
)kwargs
- Return type:
- Returns:
- upsert(objs, collection=None, **kwargs)
Upsert an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
-
collection:
- class curategpt.Extractor(schema_proxy=None, model_name=None, api_key=None, raise_error_if_unparsable=False)
Bases:
ABC
-
api_key:
str
= None
- deserialize(text, **kwargs)
Deserialize text into an annotated object
- Parameters:
text (
str
)- Return type:
- Returns:
- abstract extract(text, target_class, examples=None, **kwargs)
Schema-guided extraction
- Parameters:
text (
str
)kwargs
- Return type:
- Returns:
- property model
Get the model
- Parameters:
model_name
- Returns:
-
model_name:
str
= None
- property pydantic_root_model: BaseModel
-
raise_error_if_unparsable:
bool
= False
-
schema_proxy:
SchemaProxy
= None
- property schemaview: SchemaView
-
api_key: