curategpt.store package
Submodules
curategpt.store.chromadb_adapter module
ChromaDB adapter.
- class curategpt.store.chromadb_adapter.ChromaDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, default_model='all-MiniLM-L6-v2', client=None, id_field='id', text_lookup='text', id_to_object=<factory>, **_kwargs)
Bases:
DBAdapter
An Adapter that wraps a ChromaDB client
-
client:
ClientAPI
= None
- collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for a collection.
- Parameters:
collection_name (
Optional
[str
])- Return type:
Optional
[Metadata
]- Returns:
Parameters
- collections()
Return the names of all collections in the database.
- Return type:
Iterator
[str
]- Returns:
-
default_max_document_length:
ClassVar
[int
] = 6000
-
default_model:
str
= 'all-MiniLM-L6-v2'
- diversified_search(text=None, limit=None, relevance_factor=0.5, collection=None, **kwargs)
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]
- dump_then_load(collection=None, target=None)
Dump a collection to a file, then load it into another database.
- Parameters:
collection (
Optional
[str
])target (
Optional
[DBAdapter
])
- Returns:
- fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- find(where=None, projection=None, collection=None, **kwargs)
Query the database.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
- Parameters:
collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])projection (
Union
[str
,List
[str
],None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
id_field:
str
= 'id'
-
id_to_object:
Mapping
[str
,Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- insert(objs, **kwargs)
Insert an object or list of objects into the store.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
- list_collection_names()
List all collections in the database.
- Return type:
List
[str
]- Returns:
- lookup(id, collection=None, **kwargs)
Lookup an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Return type:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]- Returns:
- matches(obj, **kwargs)
Query the database for matches to an object.
- Parameters:
obj (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
name:
ClassVar
[str
] = 'chromadb'
- peek(collection=None, limit=5, offset=0, **kwargs)
Peek at first N objects in a collection.
- Parameters:
collection (
Optional
[str
])limit
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- static populate_venomx(collection, model, existing_venomx)
Populate venomx with data currently given when inserting
- Parameters:
collection (
Optional
[str
])model (
Optional
[str
])
:param existing_venomx :rtype:
Index
:return:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove a collection from the database.
- Parameters:
collection (
Optional
[str
])exists_ok
- Returns:
- reset()
Reset/delete the database.
- search(text, **kwargs)
Query the database for a text string.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"): ... obj_id = obj["id"] ... # print at precision of 2 decimal places ... print(f"{obj_id} {distance:.2f}") ... NeuronOfTheForebrain 0.28 ...
- Parameters:
text (
str
)collection
where
kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
tuple of object, distance, metadata
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for a collection.
-
text_lookup:
Union
[str
,Callable
,None
] = 'text'
- update(objs, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
- update_collection_metadata(collection_name, **kwargs)
Update the metadata for a collection based on the adapter.
- Parameters:
collection_name (
str
) – Name of the collection.kwargs – Additional metadata fields.
- Return type:
- Returns:
Updated Metadata instance.
- upsert(objs, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
-
client:
curategpt.store.db_adapter module
Abstract DB adapter.
- class curategpt.store.db_adapter.DBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None)
Bases:
ABC
Base class for stores.
This base class provides a common interface for a wide variety of document or object stores. The interface is intended to closely mimic the kind of interface found for document stores such as mongoDB or vector databases such as ChromaDB, but the intention is that can be used for SQL databases, SPARQL endpoints, or even file systems.
The store allows for storage and retrieval of objects which are arbitrary dictionary objects, equivalient to a JSON object.
Objects are partitioned into collections, which maps to the equivalent concept in MongoDB and ChromaDB.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert({"name": "John", "age": 42}, collection="people")
If you are used to working with MongoDB and ChromaDB APIs directly, one difference is that here we do not provide a separate Collection object, everything is handled through the store object. You can optionally bind a store object to a collection, which effectively gives you a collection object:
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.set_collection("people") >>> store.insert({"name": "John", "age": 42})
TODO: decide if this is the final interface
-
collection:
Optional
[str
] = None Default collection
- abstract collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for a collection.
- Parameters:
collection_name (
Optional
[str
])include_derived – Include derived metadata, e.g. counts
- Return type:
Optional
[Metadata
]- Returns:
- create_view(view_name, collection, expression, **kwargs)
Create a view in the database.
Todo:
- param view:
- return:
- delete(id, collection=None, **kwargs)
Delete an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Returns:
- dump(collection=None, to_file=None, metadata_to_file=None, format=None, include=None, **kwargs)
Dump the database to a file.
- Parameters:
collection (
Optional
[str
])kwargs
- Returns:
- dump_then_load(collection=None, target=None)
Dump a collection to a file, then load it into another database.
- Parameters:
collection (
Optional
[str
])target (
Optional
[DBAdapter
])
- Returns:
- abstract fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- field_names(collection=None)
Return the names of all top level fields in the database for a collection.
- Parameters:
collection (
Optional
[str
])- Return type:
List
[str
]- Returns:
- find(where=None, projection=None, collection=None, **kwargs)
Query the database.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
- Parameters:
collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])projection (
Union
[str
,List
[str
],None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
- identifier_field(collection=None)
- Return type:
str
- abstract insert(objs, collection=None, **kwargs)
Insert an object or list of objects into the store.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
- label_field(collection=None)
- Return type:
str
- abstract list_collection_names()
List all collections in the database.
- Return type:
List
[str
]- Returns:
names of collections
- abstract lookup(id, collection=None, **kwargs)
Lookup an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Return type:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]- Returns:
- lookup_multiple(ids, **kwargs)
Lookup an object by its ID.
- Parameters:
id
collection
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- abstract matches(obj, **kwargs)
Query the database for matches to an object.
- Parameters:
obj (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
name:
ClassVar
[str
] = 'base'
-
path:
str
= None Path to a location where the database is stored or disk or the network.
- abstract peek(collection=None, limit=5, **kwargs)
Peek at first N objects in a collection.
- Parameters:
collection (
Optional
[str
])limit
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove a collection from the database.
- Parameters:
collection (
Optional
[str
])- Returns:
-
schema_proxy:
Optional
[SchemaProxy
] = None Schema manager
- abstract search(text, where=None, collection=None, **kwargs)
Query the database for a text string.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"): ... obj_id = obj["id"] ... # print at precision of 2 decimal places ... print(f"{obj_id} {distance:.2f}") ... NeuronOfTheForebrain 0.28 ...
- Parameters:
text (
str
)collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
tuple of object, distance, metadata
- set_collection(collection)
Set the current collection.
If this is set, then all subsequent operations will be performed on this collection, unless overridden.
This allows the following
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.set_collection("people") >>> store.insert([{"name": "John", "age": 42}])
to be written in place of
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
collection (
str
)- Returns:
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for a collection.
>>> from curategpt.store import get_store >>> from curategpt.store import Metadata >>> store = get_store("in_memory") >>> md = store.collection_metadata(collection) >>> md.venomx.id == "People" >>> md.venomx.embedding_model.name == "openai:" >>> store.set_collection_metadata("people", cm)
- Parameters:
collection_name (
Optional
[str
])- Returns:
- update(objs, collection=None, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
- update_collection_metadata(collection_name, **kwargs)
Update the metadata for a collection.
- Parameters:
collection_name (
str
)kwargs
- Return type:
- Returns:
- upsert(objs, collection=None, **kwargs)
Upsert an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
-
collection:
curategpt.store.db_metadata module
- class curategpt.store.db_metadata.DBSettings(**data)
Bases:
BaseModel
-
M:
int
M parameter for hnsw index
-
ef_construction:
int
Construction parameter for hnsw index. Higher values are more accurate but slower.
-
ef_search:
int
Search parameter for hnsw index. Higher values are more accurate but slower.
-
hnsw_space:
str
Space used for hnsw index (e.g. ‘cosine’).
- load_config(path)
-
model:
str
Name of any embedding model
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'M': FieldInfo(annotation=int, required=False, default=16), 'ef_construction': FieldInfo(annotation=int, required=False, default=128), 'ef_search': FieldInfo(annotation=int, required=False, default=64), 'hnsw_space': FieldInfo(annotation=str, required=False, default='cosine'), 'model': FieldInfo(annotation=str, required=False, default='all-MiniLM-L6-v2'), 'name': FieldInfo(annotation=str, required=False, default='duckdb')}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
-
name:
str
Name of the database.
-
M:
curategpt.store.duckdb_adapter module
This is a DuckDB adapter for the Vector Similarity Search (VSS) extension using the experimental persistence feature
- class curategpt.store.duckdb_adapter.DuckDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, default_model='all-MiniLM-L6-v2', ef_construction=128, ef_search=64, M=16, distance_metric='cosine', id_field='id', text_lookup='text', id_to_object=<factory>, openai_client=None, **_kwargs)
Bases:
DBAdapter
-
M:
int
= 16
- collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for the collection :type collection_name:
Optional
[str
] :param collection_name: :type include_derived: :param include_derived: :type kwargs: :param kwargs: :rtype:Optional
[Metadata
] :return:
-
conn:
DuckDBPyConnection
- create_index(collection)
Create an index for the given collection Parameters ———- collection
Returns
-
default_max_document_length:
ClassVar
[int
] = 6000
-
default_model:
str
= 'all-MiniLM-L6-v2'
- static determine_fields_to_include(include=None)
Determine which fields to include in the SQL query based on the ‘include’ parameter.
- Parameters:
include (
Optional
[List
[str
]]) – List of fields to include in the output [‘metadata’, ‘embeddings’, ‘documents’]- Return type:
str
- Returns:
Comma-separated string of fields to include
-
distance_metric:
str
= 'cosine'
- dump_then_load(collection=None, target=None)
Dump the collection to a file and then load it into the target adapter :type collection:
Optional
[str
] :param collection: :type target:Optional
[DBAdapter
] :param target: :param temp_file: :param format: :return:
-
ef_construction:
int
= 128
-
ef_search:
int
= 64
- fetch_all_objects_memory_safe(collection=None, batch_size=100, include=None, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- find(where=None, projection=None, collection=None, include=None, limit=10, **kwargs)
Find objects in the collection that match the given query and projection
- Parameters:
where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
]) – the query to filter the resultsprojection (
Union
[str
,List
[str
],None
])collection (
Optional
[str
]) – name of the collection to searchinclude – fields to be included in output
limit (
int
) – maximum number of results to returnkwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
Parameters
- get_raw_objects(collection)
Get all raw objects in the collection as they were inserted into the database :type collection: :param collection: :rtype:
Iterator
[Dict
] :return:
-
id_field:
str
= 'id'
-
id_to_object:
Mapping
[str
,dict
]
- identifier_field(collection=None)
- Return type:
str
- insert(objs, **kwargs)
Insert objects into the collection :type objs:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]] :param objs: :type kwargs: :param kwargs: :return:
- static kill_process(pid)
Kill the process with the given PID Returns ——-
- list_collection_names()
List the names of all collections in the database :return:
- lookup(id, collection=None, include=None, **kwargs)
Lookup an object by its id :type id:
str
:param id: ID of the object to lookup :type collection:Optional
[str
] :param collection: Name of the collection to search :type include: :param include: List of fields to include in the output [‘metadata’, ‘embeddings’, ‘documents’] :type kwargs: :param kwargs: :rtype:Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
] :return:
- matches(obj, include=None, **kwargs)
Find objects in the collection that match the given object :type obj:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
] :param obj: :type include: :param include: :type kwargs: :param kwargs: :rtype:Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]] :return:
-
name:
ClassVar
[str
] = 'duckdb'
-
openai_client:
OpenAI
= None
- static parse_duckdb_result(results, include)
Parse the results from the SQL :rtype:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]] :return: DuckDBSearchResultIterator ———-
- peek(collection=None, limit=5, include=None, offset=0, **kwargs)
Peek at the first N objects in the collection :type collection:
Optional
[str
] :param collection: :type limit: :param limit: :type include: :param include: :type offset:int
:param offset: :type kwargs: :param kwargs: :rtype:Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]] :return:
- static populate_venomx(collection, model, distance, object_type, embeddings_dimension)
Populate venomx with data currently given when inserting
- Parameters:
collection (
Optional
[str
])model (
Optional
[str
])distance (
str
)object_type (
str
)embeddings_dimension (
int
)
- Return type:
- Returns:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove the collection from the database :type collection:
Optional
[str
] :param collection: :type exists_ok: :param exists_ok: :type kwargs: :param kwargs: :return:
- search(text, where=None, collection=None, limit=10, relevance_factor=None, include=None, **kwargs)
Search for objects in the collection that match the given text :type text:
str
:param text: :type where:Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
] :param where: :type collection:Optional
[str
] :param collection: :type limit:int
:param limit: :type relevance_factor:Optional
[float
] :param relevance_factor: :type include: :param include: :type kwargs: :param kwargs: :rtype:Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]] :return:
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for the collection :type collection_name:
Optional
[str
] :param collection_name: :type metadata:Metadata
:param metadata: :type kwargs: :param kwargs: :return:
-
text_lookup:
Union
[str
,Callable
,None
] = 'text'
- update(objs, **kwargs)
Update objects in the collection. :type objs:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]] :param objs: :type kwargs: :param kwargs: :return:
- update_collection_metadata(collection, **kwargs)
Update the metadata for a collection. This function will merge new metadata provided via kwargs with existing metadata, if any, ensuring that only the specified fields are updated. :type collection:
str
:param collection: :type kwargs: :param kwargs: :return:
- update_or_create_venomx(venomx, collection, model, distance, object_type, embeddings_dimension)
Updates an existing Index instance (venomx) with additional values or creates a new one if none is provided.
- Return type:
- upsert(objs, **kwargs)
Upsert objects into the collection :type objs:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]] :param objs: :type kwargs: :param kwargs: :return:
-
vec_dimension:
int
-
M:
curategpt.store.duckdb_connection_handler module
- class curategpt.store.duckdb_connection_handler.DuckDBConnectionAndRecoveryHandler(path)
Bases:
object
- close()
Safely close the database connection.
- Return type:
None
- connect()
Establish database connection with error handling and recovery.
Workflow as described in: https://duckdb.org/docs/extensions/vss.html#persistence
In case of any WAL related issue: - Create a temporary workspace (in-memory database with VSS) - Temporarily bring in the broken database (ATTACH) - Fix it (WAL recovery happens) - Save changes (CHECKPOINT) - Put the fixed database back (DETACH) - Clean up our temporary workspace (close) - Now safely open the fixed database normally
- Return type:
DuckDBPyConnection
curategpt.store.duckdb_result module
- class curategpt.store.duckdb_result.DuckDBSearchResult(**data)
Bases:
BaseModel
-
distances:
Optional
[float
]
-
documents:
Optional
[str
]
-
embeddings:
Optional
[List
[float
]]
-
ids:
Optional
[str
]
-
include:
Optional
[Set
[str
]]
-
metadatas:
Optional
[Dict
[str
,Any
]]
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'distances': FieldInfo(annotation=Union[float, NoneType], required=False, default=0), 'documents': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'embeddings': FieldInfo(annotation=Union[List[float], NoneType], required=False, default=None), 'ids': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'include': FieldInfo(annotation=Union[Set[str], NoneType], required=False, default=None), 'metadatas': FieldInfo(annotation=Union[Dict[str, Any], NoneType], required=False, default=None)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- to_dict()
- to_json(indent=2)
-
distances:
curategpt.store.in_memory_adapter module
Simple default adapter for a object store.
- class curategpt.store.in_memory_adapter.Collection(**data)
Bases:
BaseModel
- add(object)
- Return type:
None
- add_metadata(venomx)
- Return type:
None
- delete(key_value, key)
- Return type:
None
-
metadata:
Dict
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'metadata': FieldInfo(annotation=Dict, required=False, default={}), 'objects': FieldInfo(annotation=List[Dict], required=False, default=[])}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
-
objects:
List
[Dict
]
- class curategpt.store.in_memory_adapter.CollectionIndex(**data)
Bases:
BaseModel
-
collections:
Dict
[str
,Collection
]
- get_collection(name)
- Return type:
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'collections': FieldInfo(annotation=Dict[str, curategpt.store.in_memory_adapter.Collection], required=False, default={})}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
-
collections:
- class curategpt.store.in_memory_adapter.InMemoryAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, collection_index=<factory>)
Bases:
DBAdapter
Simple in-memory adapter for a object store.
-
collection_index:
CollectionIndex
- collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for a collection.
- Parameters:
collection_name (
Optional
[str
])include_derived – Include derived metadata, e.g. counts
- Return type:
Optional
[Metadata
]- Returns:
- delete(id, collection=None, **kwargs)
Delete an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Returns:
- fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- find(where=None, projection=None, collection=None, **kwargs)
Query the database.
- Parameters:
text
collection (
Optional
[str
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
- insert(objs, collection=None, **kwargs)
Insert an object or list of objects into the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
- list_collection_names()
List all collections in the database.
- Return type:
List
[str
]- Returns:
- lookup(id, collection=None, **kwargs)
Lookup an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Return type:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]- Returns:
- matches(obj, **kwargs)
Query the database for matches to an object.
- Parameters:
obj (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
name:
ClassVar
[str
] = 'in_memory'
- peek(collection=None, limit=5, **kwargs)
Peek at first N objects in a collection.
- Parameters:
collection (
Optional
[str
])limit
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- static populate_venomx(collection, model=None, distance=None, object_type=None, embeddings_dimension=None, index_fields=None)
Populate venomx with data currently given when inserting
- Parameters:
collection (
Optional
[str
])model (
Optional
[str
])distance (
Optional
[str
])object_type (
Optional
[str
])embeddings_dimension (
Optional
[int
])index_fields (
Union
[List
[str
],Tuple
[str
],None
])
- Return type:
- Returns:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove a collection from the database.
- Parameters:
collection (
Optional
[str
])- Returns:
- search(text, where=None, collection=None, **kwargs)
Query the database for a text string.
- Parameters:
text (
str
)collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for a collection.
- Parameters:
collection_name (
Optional
[str
])- Returns:
- update(objs, collection=None, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
- update_collection_metadata(collection_name, **kwargs)
Update the metadata for a collection.
- Parameters:
collection_name (
str
)kwargs
- Return type:
- Returns:
- upsert(objs, collection=None, **kwargs)
Upsert an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
-
collection_index:
curategpt.store.metadata module
- class curategpt.store.metadata.Metadata(**data)
Bases:
BaseModel
- classmethod deserialize_venomx_metadata_from_adapter(metadata_dict, adapter)
Create a Metadata instance from adapter-specific metadata dictionary. ChromaDB: _venomx is deserialized back into venomx. (str to dict) DuckDB: venomx is accessed directly as a nested object. :type metadata_dict:
dict
:param metadata_dict: Metadata dictionary from the adapter. :type adapter:str
:param adapter: Adapter name (e.g., ‘chroma’, ‘duckdb’). :rtype:Dict
:return: Metadata instance.
-
hnsw_space:
Optional
[str
] Space used for hnsw index (e.g. ‘cosine’)
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'hnsw_space': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'object_count': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'object_type': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'venomx': FieldInfo(annotation=Union[Index, NoneType], required=False, default=None)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- model_post_init(context, /)
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Return type:
None
- Args:
self: The BaseModel instance. context: The context.
-
object_count:
Optional
[int
]
-
object_type:
Optional
[str
] Type of object in the collection
- serialize_venomx_metadata_for_adapter(adapter)
Convert the Metadata instance to a dictionary suitable for the specified adapter. ChromaDB: venomx is serialized into _venomx before storing. (dict to str) DuckDB: venomx remains as an Index object without serialization. :type adapter:
str
:param adapter: Adapter name (e.g., ‘chroma’, ‘duckdb’). :rtype:dict
:return: Metadata dictionary.
-
venomx:
Optional
[Index
] Retains the complex venomx Index object for internal application use. Index is the main object of venomx https://github.com/cmungall/venomx
curategpt.store.schema_proxy module
- class curategpt.store.schema_proxy.SchemaProxy(schema_source=None, _pydantic_root_model=None, _schemaview=None)
Bases:
object
Manage connection to a schema
- json_schema()
Get the JSON schema translation of the schema. :rtype:
Dict
:return:
- model_config = {'protected_namespaces': ()}
- property name: str | None
Get the name of the schema.
- Returns:
- property pydantic_root_model: BaseModel
Get the pydantic root model.
If none is set, then generate it from the schema.
- Returns:
- property schema: SchemaDefinition
Get the schema
- Returns:
-
schema_source:
Union
[str
,Path
,SchemaDefinition
] = None
- property schemaview: SchemaView
Get the schema view.
- Returns:
curategpt.store.vocab module
Module contents
Adapters for different document stores.
Base class:
DBAdapter
Currently only one implementation is provided, for ChromaDB. In future this will index
MongoDB
ElasticSearch
Solr
Postgres
SQLite
Note: this package may become an independent project called linkml-store in the future.
- class curategpt.store.ChromaDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, default_model='all-MiniLM-L6-v2', client=None, id_field='id', text_lookup='text', id_to_object=<factory>, **_kwargs)
Bases:
DBAdapter
An Adapter that wraps a ChromaDB client
-
client:
ClientAPI
= None
- collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for a collection.
- Parameters:
collection_name (
Optional
[str
])- Return type:
Optional
[Metadata
]- Returns:
Parameters
- collections()
Return the names of all collections in the database.
- Return type:
Iterator
[str
]- Returns:
-
default_max_document_length:
ClassVar
[int
] = 6000
-
default_model:
str
= 'all-MiniLM-L6-v2'
- diversified_search(text=None, limit=None, relevance_factor=0.5, collection=None, **kwargs)
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]
- dump_then_load(collection=None, target=None)
Dump a collection to a file, then load it into another database.
- Parameters:
collection (
Optional
[str
])target (
Optional
[DBAdapter
])
- Returns:
- fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- find(where=None, projection=None, collection=None, **kwargs)
Query the database.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
- Parameters:
collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])projection (
Union
[str
,List
[str
],None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
id_field:
str
= 'id'
-
id_to_object:
Mapping
[str
,Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- insert(objs, **kwargs)
Insert an object or list of objects into the store.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
- list_collection_names()
List all collections in the database.
- Return type:
List
[str
]- Returns:
- lookup(id, collection=None, **kwargs)
Lookup an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Return type:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]- Returns:
- matches(obj, **kwargs)
Query the database for matches to an object.
- Parameters:
obj (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
name:
ClassVar
[str
] = 'chromadb'
- peek(collection=None, limit=5, offset=0, **kwargs)
Peek at first N objects in a collection.
- Parameters:
collection (
Optional
[str
])limit
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- static populate_venomx(collection, model, existing_venomx)
Populate venomx with data currently given when inserting
- Parameters:
collection (
Optional
[str
])model (
Optional
[str
])
:param existing_venomx :rtype:
Index
:return:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove a collection from the database.
- Parameters:
collection (
Optional
[str
])exists_ok
- Returns:
- reset()
Reset/delete the database.
- search(text, **kwargs)
Query the database for a text string.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"): ... obj_id = obj["id"] ... # print at precision of 2 decimal places ... print(f"{obj_id} {distance:.2f}") ... NeuronOfTheForebrain 0.28 ...
- Parameters:
text (
str
)collection
where
kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
tuple of object, distance, metadata
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for a collection.
-
text_lookup:
Union
[str
,Callable
,None
] = 'text'
- update(objs, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
- update_collection_metadata(collection_name, **kwargs)
Update the metadata for a collection based on the adapter.
- Parameters:
collection_name (
str
) – Name of the collection.kwargs – Additional metadata fields.
- Return type:
- Returns:
Updated Metadata instance.
- upsert(objs, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection
- Returns:
-
client:
- class curategpt.store.DBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None)
Bases:
ABC
Base class for stores.
This base class provides a common interface for a wide variety of document or object stores. The interface is intended to closely mimic the kind of interface found for document stores such as mongoDB or vector databases such as ChromaDB, but the intention is that can be used for SQL databases, SPARQL endpoints, or even file systems.
The store allows for storage and retrieval of objects which are arbitrary dictionary objects, equivalient to a JSON object.
Objects are partitioned into collections, which maps to the equivalent concept in MongoDB and ChromaDB.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert({"name": "John", "age": 42}, collection="people")
If you are used to working with MongoDB and ChromaDB APIs directly, one difference is that here we do not provide a separate Collection object, everything is handled through the store object. You can optionally bind a store object to a collection, which effectively gives you a collection object:
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.set_collection("people") >>> store.insert({"name": "John", "age": 42})
TODO: decide if this is the final interface
-
collection:
Optional
[str
] = None Default collection
- abstract collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for a collection.
- Parameters:
collection_name (
Optional
[str
])include_derived – Include derived metadata, e.g. counts
- Return type:
Optional
[Metadata
]- Returns:
- create_view(view_name, collection, expression, **kwargs)
Create a view in the database.
Todo:
- param view:
- return:
- delete(id, collection=None, **kwargs)
Delete an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Returns:
- dump(collection=None, to_file=None, metadata_to_file=None, format=None, include=None, **kwargs)
Dump the database to a file.
- Parameters:
collection (
Optional
[str
])kwargs
- Returns:
- dump_then_load(collection=None, target=None)
Dump a collection to a file, then load it into another database.
- Parameters:
collection (
Optional
[str
])target (
Optional
[DBAdapter
])
- Returns:
- abstract fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- field_names(collection=None)
Return the names of all top level fields in the database for a collection.
- Parameters:
collection (
Optional
[str
])- Return type:
List
[str
]- Returns:
- find(where=None, projection=None, collection=None, **kwargs)
Query the database.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
- Parameters:
collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])projection (
Union
[str
,List
[str
],None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
- identifier_field(collection=None)
- Return type:
str
- abstract insert(objs, collection=None, **kwargs)
Insert an object or list of objects into the store.
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
- label_field(collection=None)
- Return type:
str
- abstract list_collection_names()
List all collections in the database.
- Return type:
List
[str
]- Returns:
names of collections
- abstract lookup(id, collection=None, **kwargs)
Lookup an object by its ID.
- Parameters:
id (
str
)collection (
Optional
[str
])
- Return type:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]- Returns:
- lookup_multiple(ids, **kwargs)
Lookup an object by its ID.
- Parameters:
id
collection
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- abstract matches(obj, **kwargs)
Query the database for matches to an object.
- Parameters:
obj (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
-
name:
ClassVar
[str
] = 'base'
-
path:
str
= None Path to a location where the database is stored or disk or the network.
- abstract peek(collection=None, limit=5, **kwargs)
Peek at first N objects in a collection.
- Parameters:
collection (
Optional
[str
])limit
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]- Returns:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove a collection from the database.
- Parameters:
collection (
Optional
[str
])- Returns:
-
schema_proxy:
Optional
[SchemaProxy
] = None Schema manager
- abstract search(text, where=None, collection=None, **kwargs)
Query the database for a text string.
>>> from curategpt.store import get_store >>> store = get_store("chromadb", "db") >>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"): ... obj_id = obj["id"] ... # print at precision of 2 decimal places ... print(f"{obj_id} {distance:.2f}") ... NeuronOfTheForebrain 0.28 ...
- Parameters:
text (
str
)collection (
Optional
[str
])where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
])kwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
tuple of object, distance, metadata
- set_collection(collection)
Set the current collection.
If this is set, then all subsequent operations will be performed on this collection, unless overridden.
This allows the following
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.set_collection("people") >>> store.insert([{"name": "John", "age": 42}])
to be written in place of
>>> from curategpt.store import get_store >>> store = get_store("in_memory") >>> store.insert([{"name": "John", "age": 42}], collection="people")
- Parameters:
collection (
str
)- Returns:
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for a collection.
>>> from curategpt.store import get_store >>> from curategpt.store import Metadata >>> store = get_store("in_memory") >>> md = store.collection_metadata(collection) >>> md.venomx.id == "People" >>> md.venomx.embedding_model.name == "openai:" >>> store.set_collection_metadata("people", cm)
- Parameters:
collection_name (
Optional
[str
])- Returns:
- update(objs, collection=None, **kwargs)
Update an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
- update_collection_metadata(collection_name, **kwargs)
Update the metadata for a collection.
- Parameters:
collection_name (
str
)kwargs
- Return type:
- Returns:
- upsert(objs, collection=None, **kwargs)
Upsert an object or list of objects in the store.
- Parameters:
objs (
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,List
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]])collection (
Optional
[str
])
- Returns:
-
collection:
- class curategpt.store.DuckDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, default_model='all-MiniLM-L6-v2', ef_construction=128, ef_search=64, M=16, distance_metric='cosine', id_field='id', text_lookup='text', id_to_object=<factory>, openai_client=None, **_kwargs)
Bases:
DBAdapter
-
M:
int
= 16
- collection_metadata(collection_name=None, include_derived=False, **kwargs)
Get the metadata for the collection :type collection_name:
Optional
[str
] :param collection_name: :type include_derived: :param include_derived: :type kwargs: :param kwargs: :rtype:Optional
[Metadata
] :return:
-
conn:
DuckDBPyConnection
- create_index(collection)
Create an index for the given collection Parameters ———- collection
Returns
-
default_max_document_length:
ClassVar
[int
] = 6000
-
default_model:
str
= 'all-MiniLM-L6-v2'
- static determine_fields_to_include(include=None)
Determine which fields to include in the SQL query based on the ‘include’ parameter.
- Parameters:
include (
Optional
[List
[str
]]) – List of fields to include in the output [‘metadata’, ‘embeddings’, ‘documents’]- Return type:
str
- Returns:
Comma-separated string of fields to include
-
distance_metric:
str
= 'cosine'
- dump_then_load(collection=None, target=None)
Dump the collection to a file and then load it into the target adapter :type collection:
Optional
[str
] :param collection: :type target:Optional
[DBAdapter
] :param target: :param temp_file: :param format: :return:
-
ef_construction:
int
= 128
-
ef_search:
int
= 64
- fetch_all_objects_memory_safe(collection=None, batch_size=100, include=None, **kwargs)
Fetch all objects from a collection, in batches to avoid memory overload.
- Return type:
Iterator
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]
- find(where=None, projection=None, collection=None, include=None, limit=10, **kwargs)
Find objects in the collection that match the given query and projection
- Parameters:
where (
Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
]) – the query to filter the resultsprojection (
Union
[str
,List
[str
],None
])collection (
Optional
[str
]) – name of the collection to searchinclude – fields to be included in output
limit (
int
) – maximum number of results to returnkwargs
- Return type:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]]- Returns:
Parameters
- get_raw_objects(collection)
Get all raw objects in the collection as they were inserted into the database :type collection: :param collection: :rtype:
Iterator
[Dict
] :return:
-
id_field:
str
= 'id'
-
id_to_object:
Mapping
[str
,dict
]
- identifier_field(collection=None)
- Return type:
str
- insert(objs, **kwargs)
Insert objects into the collection :type objs:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]] :param objs: :type kwargs: :param kwargs: :return:
- static kill_process(pid)
Kill the process with the given PID Returns ——-
- list_collection_names()
List the names of all collections in the database :return:
- lookup(id, collection=None, include=None, **kwargs)
Lookup an object by its id :type id:
str
:param id: ID of the object to lookup :type collection:Optional
[str
] :param collection: Name of the collection to search :type include: :param include: List of fields to include in the output [‘metadata’, ‘embeddings’, ‘documents’] :type kwargs: :param kwargs: :rtype:Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
] :return:
- matches(obj, include=None, **kwargs)
Find objects in the collection that match the given object :type obj:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
] :param obj: :type include: :param include: :type kwargs: :param kwargs: :rtype:Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]] :return:
-
name:
ClassVar
[str
] = 'duckdb'
-
openai_client:
OpenAI
= None
- static parse_duckdb_result(results, include)
Parse the results from the SQL :rtype:
Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]] :return: DuckDBSearchResultIterator ———-
- peek(collection=None, limit=5, include=None, offset=0, **kwargs)
Peek at the first N objects in the collection :type collection:
Optional
[str
] :param collection: :type limit: :param limit: :type include: :param include: :type offset:int
:param offset: :type kwargs: :param kwargs: :rtype:Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]] :return:
- static populate_venomx(collection, model, distance, object_type, embeddings_dimension)
Populate venomx with data currently given when inserting
- Parameters:
collection (
Optional
[str
])model (
Optional
[str
])distance (
str
)object_type (
str
)embeddings_dimension (
int
)
- Return type:
- Returns:
- remove_collection(collection=None, exists_ok=False, **kwargs)
Remove the collection from the database :type collection:
Optional
[str
] :param collection: :type exists_ok: :param exists_ok: :type kwargs: :param kwargs: :return:
- search(text, where=None, collection=None, limit=10, relevance_factor=None, include=None, **kwargs)
Search for objects in the collection that match the given text :type text:
str
:param text: :type where:Union
[str
,YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,None
] :param where: :type collection:Optional
[str
] :param collection: :type limit:int
:param limit: :type relevance_factor:Optional
[float
] :param relevance_factor: :type include: :param include: :type kwargs: :param kwargs: :rtype:Iterator
[Tuple
[DuckDBSearchResult
,Dict
,float
,Optional
[Dict
]]] :return:
- set_collection_metadata(collection_name, metadata, **kwargs)
Set the metadata for the collection :type collection_name:
Optional
[str
] :param collection_name: :type metadata:Metadata
:param metadata: :type kwargs: :param kwargs: :return:
-
text_lookup:
Union
[str
,Callable
,None
] = 'text'
- update(objs, **kwargs)
Update objects in the collection. :type objs:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]] :param objs: :type kwargs: :param kwargs: :return:
- update_collection_metadata(collection, **kwargs)
Update the metadata for a collection. This function will merge new metadata provided via kwargs with existing metadata, if any, ensuring that only the specified fields are updated. :type collection:
str
:param collection: :type kwargs: :param kwargs: :return:
- update_or_create_venomx(venomx, collection, model, distance, object_type, embeddings_dimension)
Updates an existing Index instance (venomx) with additional values or creates a new one if none is provided.
- Return type:
- upsert(objs, **kwargs)
Upsert objects into the collection :type objs:
Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
,Iterable
[Union
[YAMLRoot
,BaseModel
,Dict
,DuckDBSearchResult
]]] :param objs: :type kwargs: :param kwargs: :return:
-
vec_dimension:
int
-
M:
- class curategpt.store.Metadata(**data)
Bases:
BaseModel
- classmethod deserialize_venomx_metadata_from_adapter(metadata_dict, adapter)
Create a Metadata instance from adapter-specific metadata dictionary. ChromaDB: _venomx is deserialized back into venomx. (str to dict) DuckDB: venomx is accessed directly as a nested object. :type metadata_dict:
dict
:param metadata_dict: Metadata dictionary from the adapter. :type adapter:str
:param adapter: Adapter name (e.g., ‘chroma’, ‘duckdb’). :rtype:Dict
:return: Metadata instance.
-
hnsw_space:
Optional
[str
] Space used for hnsw index (e.g. ‘cosine’)
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'hnsw_space': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'object_count': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'object_type': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'venomx': FieldInfo(annotation=Union[Index, NoneType], required=False, default=None)}
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- model_post_init(context, /)
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Return type:
None
- Args:
self: The BaseModel instance. context: The context.
-
object_count:
Optional
[int
]
-
object_type:
Optional
[str
] Type of object in the collection
- serialize_venomx_metadata_for_adapter(adapter)
Convert the Metadata instance to a dictionary suitable for the specified adapter. ChromaDB: venomx is serialized into _venomx before storing. (dict to str) DuckDB: venomx remains as an Index object without serialization. :type adapter:
str
:param adapter: Adapter name (e.g., ‘chroma’, ‘duckdb’). :rtype:dict
:return: Metadata dictionary.
-
venomx:
Optional
[Index
] Retains the complex venomx Index object for internal application use. Index is the main object of venomx https://github.com/cmungall/venomx
- class curategpt.store.SchemaProxy(schema_source=None, _pydantic_root_model=None, _schemaview=None)
Bases:
object
Manage connection to a schema
- json_schema()
Get the JSON schema translation of the schema. :rtype:
Dict
:return:
- model_config = {'protected_namespaces': ()}
- property name: str | None
Get the name of the schema.
- Returns:
- property pydantic_root_model: BaseModel
Get the pydantic root model.
If none is set, then generate it from the schema.
- Returns:
- property schema: SchemaDefinition
Get the schema
- Returns:
-
schema_source:
Union
[str
,Path
,SchemaDefinition
] = None
- property schemaview: SchemaView
Get the schema view.
- Returns: