curate_gpt.store package

Submodules

curate_gpt.store.chromadb_adapter module

ChromaDB adapter.

class curate_gpt.store.chromadb_adapter.ChromaDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, client=None, id_field='id', text_lookup='text', id_to_object=<factory>, **_kwargs)

Bases: DBAdapter

An Adapter that wraps a ChromaDB client

client: ClientAPI = None
collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for a collection.

Parameters:

collection_name (Optional[str])

Return type:

Optional[CollectionMetadata]

Returns:

Parameters

collections()

Return the names of all collections in the database.

Return type:

Iterator[str]

Returns:

default_max_document_length: ClassVar[int] = 6000
default_model = 'all-MiniLM-L6-v2'
Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

dump_then_load(collection=None, target=None)

Dump a collection to a file, then load it into another database.

Parameters:
  • collection (Optional[str])

  • target (Optional[DBAdapter])

Returns:

fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

find(where=None, projection=None, collection=None, **kwargs)

Query the database.

>>> from curate_gpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
Parameters:
  • collection (Optional[str])

  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None])

  • projection (Union[str, List[str], None])

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

id_field: str = 'id'
id_to_object: Mapping[str, Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]
insert(objs, **kwargs)

Insert an object or list of objects into the store.

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")
Parameters:
Returns:

list_collection_names()

List all collections in the database.

Return type:

List[str]

Returns:

lookup(id, collection=None, **kwargs)

Lookup an object by its ID.

Parameters:
  • id (str)

  • collection (Optional[str])

Return type:

Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]

Returns:

matches(obj, **kwargs)

Query the database for matches to an object.

Parameters:
Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

name: ClassVar[str] = 'chromadb'
peek(collection=None, limit=5, offset=0, **kwargs)

Peek at first N objects in a collection.

Parameters:
  • collection (Optional[str])

  • limit

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove a collection from the database.

Parameters:
  • collection (Optional[str])

  • exists_ok

Returns:

reset()

Reset/delete the database.

search(text, **kwargs)

Query the database for a text string.

>>> from curate_gpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"):
...     obj_id = obj["id"]
...     # print at precision of 2 decimal places
...     print(f"{obj_id} {distance:.2f}")

...
NeuronOfTheForebrain 0.28
...
Parameters:
  • text (str)

  • collection

  • where

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

tuple of object, distance, metadata

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for a collection.

Parameters:
Returns:

text_lookup: Union[str, Callable, None] = 'text'
update(objs, **kwargs)

Update an object or list of objects in the store.

Parameters:
Returns:

update_collection_metadata(collection_name, **kwargs)

Update the metadata for a collection.

Parameters:
  • collection_name (str)

  • kwargs

Return type:

CollectionMetadata

Returns:

upsert(objs, **kwargs)

Update an object or list of objects in the store.

Parameters:
Returns:

curate_gpt.store.db_adapter module

Abstract DB adapter.

class curate_gpt.store.db_adapter.DBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None)

Bases: ABC

Base class for stores.

This base class provides a common interface for a wide variety of document or object stores. The interface is intended to closely mimic the kind of interface found for document stores such as mongoDB or vector databases such as ChromaDB, but the intention is that can be used for SQL databases, SPARQL endpoints, or even file systems.

The store allows for storage and retrieval of objects which are arbitrary dictionary objects, equivalient to a JSON object.

Objects are partitioned into collections, which maps to the equivalent concept in MongoDB and ChromaDB.

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert({"name": "John", "age": 42}, collection="people")

If you are used to working with MongoDB and ChromaDB APIs directly, one difference is that here we do not provide a separate Collection object, everything is handled through the store object. You can optionally bind a store object to a collection, which effectively gives you a collection object:

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.set_collection("people")
>>> store.insert({"name": "John", "age": 42})

TODO: decide if this is the final interface

collection: Optional[str] = None

Default collection

abstract collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for a collection.

Parameters:
  • collection_name (Optional[str])

  • include_derived – Include derived metadata, e.g. counts

Return type:

Optional[CollectionMetadata]

Returns:

create_view(view_name, collection, expression, **kwargs)

Create a view in the database.

Todo:

param view:

return:

delete(id, collection=None, **kwargs)

Delete an object by its ID.

Parameters:
  • id (str)

  • collection (Optional[str])

Returns:

dump(collection=None, to_file=None, metadata_to_file=None, format=None, include=None, **kwargs)

Dump the database to a file.

Parameters:
  • collection (Optional[str])

  • kwargs

Returns:

dump_then_load(collection=None, target=None)

Dump a collection to a file, then load it into another database.

Parameters:
  • collection (Optional[str])

  • target (Optional[DBAdapter])

Returns:

abstract fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

field_names(collection=None)

Return the names of all top level fields in the database for a collection.

Parameters:

collection (Optional[str])

Return type:

List[str]

Returns:

find(where=None, projection=None, collection=None, **kwargs)

Query the database.

>>> from curate_gpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
Parameters:
  • collection (Optional[str])

  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None])

  • projection (Union[str, List[str], None])

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

identifier_field(collection=None)
Return type:

str

abstract insert(objs, collection=None, **kwargs)

Insert an object or list of objects into the store.

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")
Parameters:
Returns:

label_field(collection=None)
Return type:

str

abstract list_collection_names()

List all collections in the database.

Return type:

List[str]

Returns:

names of collections

abstract lookup(id, collection=None, **kwargs)

Lookup an object by its ID.

Parameters:
  • id (str)

  • collection (Optional[str])

Return type:

Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]

Returns:

lookup_multiple(ids, **kwargs)

Lookup an object by its ID.

Parameters:
  • id

  • collection

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

abstract matches(obj, **kwargs)

Query the database for matches to an object.

Parameters:
Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

name: ClassVar[str] = 'base'
path: str = None

Path to a location where the database is stored or disk or the network.

abstract peek(collection=None, limit=5, **kwargs)

Peek at first N objects in a collection.

Parameters:
  • collection (Optional[str])

  • limit

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove a collection from the database.

Parameters:

collection (Optional[str])

Returns:

schema_proxy: Optional[SchemaProxy] = None

Schema manager

abstract search(text, where=None, collection=None, **kwargs)

Query the database for a text string.

>>> from curate_gpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"):
...     obj_id = obj["id"]
...     # print at precision of 2 decimal places
...     print(f"{obj_id} {distance:.2f}")

...
NeuronOfTheForebrain 0.28
...
Parameters:
  • text (str)

  • collection (Optional[str])

  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None])

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

tuple of object, distance, metadata

set_collection(collection)

Set the current collection.

If this is set, then all subsequent operations will be performed on this collection, unless overridden.

This allows the following

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.set_collection("people")
>>> store.insert([{"name": "John", "age": 42}])

to be written in place of

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")
Parameters:

collection (str)

Returns:

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for a collection.

>>> from curate_gpt.store import get_store
>>> from curate_gpt.store import CollectionMetadata
>>> store = get_store("in_memory")
>>> cm = CollectionMetadata(name="People", description="People in the database")
>>> store.set_collection_metadata("people", cm)
Parameters:

collection_name (Optional[str])

Returns:

update(objs, collection=None, **kwargs)

Update an object or list of objects in the store.

Parameters:
Returns:

update_collection_metadata(collection_name, **kwargs)

Update the metadata for a collection.

Parameters:
  • collection_name (str)

  • kwargs

Return type:

CollectionMetadata

Returns:

upsert(objs, collection=None, **kwargs)

Upsert an object or list of objects in the store.

Parameters:
Returns:

curate_gpt.store.db_metadata module

class curate_gpt.store.db_metadata.DBSettings(**data)

Bases: BaseModel

M: int

M parameter for hnsw index

ef_construction: int

Construction parameter for hnsw index. Higher values are more accurate but slower.

Search parameter for hnsw index. Higher values are more accurate but slower.

hnsw_space: str

Space used for hnsw index (e.g. ‘cosine’).

load_config(path)
model: str

Name of any embedding model

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'M': FieldInfo(annotation=int, required=False, default=16), 'ef_construction': FieldInfo(annotation=int, required=False, default=128), 'ef_search': FieldInfo(annotation=int, required=False, default=64), 'hnsw_space': FieldInfo(annotation=str, required=False, default='cosine'), 'model': FieldInfo(annotation=str, required=False, default='all-MiniLM-L6-v2'), 'name': FieldInfo(annotation=str, required=False, default='duckdb')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

name: str

Name of the database.

curate_gpt.store.duckdb_adapter module

This is a DuckDB adapter for the Vector Similarity Search (VSS) extension using the experimental persistence feature

class curate_gpt.store.duckdb_adapter.DuckDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, default_model='all-MiniLM-L6-v2', ef_construction=128, ef_search=64, M=16, distance_metric='cosine', id_field='id', text_lookup='text', id_to_object=<factory>, openai_client=None, **_kwargs)

Bases: DBAdapter

M: int = 16
collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for the collection :type collection_name: Optional[str] :param collection_name: :type include_derived: :param include_derived: :type kwargs: :param kwargs: :rtype: Optional[CollectionMetadata] :return:

conn: DuckDBPyConnection
create_index(collection)

Create an index for the given collection Parameters ———- collection

Returns

default_max_document_length: ClassVar[int] = 6000
default_model: str = 'all-MiniLM-L6-v2'
static determine_fields_to_include(include=None)

Determine which fields to include in the SQL query based on the ‘include’ parameter.

Parameters:

include (Optional[List[str]]) – List of fields to include in the output [‘metadata’, ‘embeddings’, ‘documents’]

Return type:

str

Returns:

Comma-separated string of fields to include

distance_metric: str = 'cosine'
dump_then_load(collection=None, target=None)

Dump the collection to a file and then load it into the target adapter :type collection: Optional[str] :param collection: :type target: Optional[DBAdapter] :param target: :param temp_file: :param format: :return:

ef_construction: int = 128
fetch_all_objects_memory_safe(collection=None, batch_size=100, include=None, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

find(where=None, projection=None, collection=None, include=None, limit=10, **kwargs)

Find objects in the collection that match the given query and projection

Parameters:
  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None]) – the query to filter the results

  • projection (Union[str, List[str], None])

  • collection (Optional[str]) – name of the collection to search

  • include – fields to be included in output

  • limit (int) – maximum number of results to return

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

Parameters

get_raw_objects(collection)

Get all raw objects in the collection as they were inserted into the database :type collection: :param collection: :rtype: Iterator[Dict] :return:

id_field: str = 'id'
id_to_object: Mapping[str, dict]
identifier_field(collection=None)
Return type:

str

insert(objs, **kwargs)

Insert objects into the collection :type objs: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, Iterable[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]] :param objs: :type kwargs: :param kwargs: :return:

static kill_process(pid)

Kill the process with the given PID Returns ——-

list_collection_names()

List the names of all collections in the database :return:

lookup(id, collection=None, include=None, **kwargs)

Lookup an object by its id :type id: str :param id: ID of the object to lookup :type collection: Optional[str] :param collection: Name of the collection to search :type include: :param include: List of fields to include in the output [‘metadata’, ‘embeddings’, ‘documents’] :type kwargs: :param kwargs: :rtype: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult] :return:

matches(obj, include=None, **kwargs)

Find objects in the collection that match the given object :type obj: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult] :param obj: :type include: :param include: :type kwargs: :param kwargs: :rtype: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]] :return:

name: ClassVar[str] = 'duckdb'
openai_client: OpenAI = None
static parse_duckdb_result(results, include)

Parse the results from the SQL :rtype: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]] :return: DuckDBSearchResultIterator ———-

peek(collection=None, limit=5, include=None, offset=0, **kwargs)

Peek at the first N objects in the collection :type collection: Optional[str] :param collection: :type limit: :param limit: :type include: :param include: :type offset: int :param offset: :type kwargs: :param kwargs: :rtype: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]] :return:

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove the collection from the database :type collection: Optional[str] :param collection: :type exists_ok: :param exists_ok: :type kwargs: :param kwargs: :return:

search(text, where=None, collection=None, limit=10, relevance_factor=None, include=None, **kwargs)

Search for objects in the collection that match the given text :type text: str :param text: :type where: Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None] :param where: :type collection: Optional[str] :param collection: :type limit: int :param limit: :type relevance_factor: Optional[float] :param relevance_factor: :type include: :param include: :type kwargs: :param kwargs: :rtype: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]] :return:

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for the collection :type collection_name: Optional[str] :param collection_name: :type metadata: CollectionMetadata :param metadata: :type kwargs: :param kwargs: :return:

text_lookup: Union[str, Callable, None] = 'text'
update(objs, **kwargs)

Update objects in the collection. :type objs: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, Iterable[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]] :param objs: :type kwargs: :param kwargs: :return:

update_collection_metadata(collection, **kwargs)

Update the metadata for a collection. This function will merge new metadata provided via kwargs with existing metadata, if any, ensuring that only the specified fields are updated. :type collection: str :param collection: :type kwargs: :param kwargs: :return:

upsert(objs, **kwargs)

Upsert objects into the collection :type objs: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, Iterable[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]] :param objs: :type kwargs: :param kwargs: :return:

vec_dimension: int

curate_gpt.store.duckdb_result module

class curate_gpt.store.duckdb_result.DuckDBSearchResult(**data)

Bases: BaseModel

distances: Optional[float]
documents: Optional[str]
embeddings: Optional[List[float]]
ids: Optional[str]
include: Optional[Set[str]]
metadatas: Optional[Dict[str, Any]]
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'distances': FieldInfo(annotation=Union[float, NoneType], required=False, default=0), 'documents': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'embeddings': FieldInfo(annotation=Union[List[float], NoneType], required=False, default=None), 'ids': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'include': FieldInfo(annotation=Union[Set[str], NoneType], required=False, default=None), 'metadatas': FieldInfo(annotation=Union[Dict[str, Any], NoneType], required=False, default=None)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

to_dict()
to_json(indent=2)

curate_gpt.store.in_memory_adapter module

Simple default adapter for a object store.

class curate_gpt.store.in_memory_adapter.Collection(**data)

Bases: BaseModel

add(object)
Return type:

None

delete(key_value, key)
Return type:

None

metadata: Dict
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'metadata': FieldInfo(annotation=Dict, required=False, default={}), 'objects': FieldInfo(annotation=List[Dict], required=False, default=[])}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

objects: List[Dict]
class curate_gpt.store.in_memory_adapter.CollectionIndex(**data)

Bases: BaseModel

collections: Dict[str, Collection]
get_collection(name)
Return type:

Collection

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'collections': FieldInfo(annotation=Dict[str, curate_gpt.store.in_memory_adapter.Collection], required=False, default={})}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

class curate_gpt.store.in_memory_adapter.InMemoryAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, collection_index=<factory>)

Bases: DBAdapter

Simple in-memory adapter for a object store.

collection_index: CollectionIndex
collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for a collection.

Parameters:
  • collection_name (Optional[str])

  • include_derived – Include derived metadata, e.g. counts

Return type:

Optional[CollectionMetadata]

Returns:

delete(id, collection=None, **kwargs)

Delete an object by its ID.

Parameters:
  • id (str)

  • collection (Optional[str])

Returns:

fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

find(where=None, projection=None, collection=None, **kwargs)

Query the database.

Parameters:
  • text

  • collection (Optional[str])

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

insert(objs, collection=None, **kwargs)

Insert an object or list of objects into the store.

Parameters:
Returns:

list_collection_names()

List all collections in the database.

Return type:

List[str]

Returns:

lookup(id, collection=None, **kwargs)

Lookup an object by its ID.

Parameters:
  • id (str)

  • collection (Optional[str])

Return type:

Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]

Returns:

matches(obj, **kwargs)

Query the database for matches to an object.

Parameters:
Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

name: ClassVar[str] = 'in_memory'
peek(collection=None, limit=5, **kwargs)

Peek at first N objects in a collection.

Parameters:
  • collection (Optional[str])

  • limit

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove a collection from the database.

Parameters:

collection (Optional[str])

Returns:

search(text, where=None, collection=None, **kwargs)

Query the database for a text string.

Parameters:
  • text (str)

  • collection (Optional[str])

  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None])

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for a collection.

Parameters:

collection_name (Optional[str])

Returns:

update(objs, collection=None, **kwargs)

Update an object or list of objects in the store.

Parameters:
Returns:

update_collection_metadata(collection_name, **kwargs)

Update the metadata for a collection.

Parameters:
  • collection_name (str)

  • kwargs

Return type:

CollectionMetadata

Returns:

upsert(objs, collection=None, **kwargs)

Upsert an object or list of objects in the store.

Parameters:
Returns:

curate_gpt.store.metadata module

class curate_gpt.store.metadata.CollectionMetadata(**data)

Bases: BaseModel

Metadata about a collection.

This is an open class, so additional metadata can be added.

annotations: Optional[Dict]

Additional metadata

description: Optional[str]

Description of the collection

hnsw_space: Optional[str]

Space used for hnsw index (e.g. ‘cosine’)

model: Optional[str]

Name of any ML model

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'annotations': FieldInfo(annotation=Union[Dict, NoneType], required=False, default=None), 'description': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'hnsw_space': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'model': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'object_count': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'object_type': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'source': FieldInfo(annotation=Union[str, NoneType], required=False, default=None)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

name: Optional[str]

Name of the collection

object_count: Optional[int]

Number of objects in the collection

object_type: Optional[str]

Type of object in the collection

source: Optional[str]

Source of the collection

curate_gpt.store.schema_proxy module

class curate_gpt.store.schema_proxy.SchemaProxy(schema_source=None, _pydantic_root_model=None, _schemaview=None)

Bases: object

Manage connection to a schema

json_schema()

Get the JSON schema translation of the schema. :rtype: Dict :return:

model_config = {'protected_namespaces': ()}
property name: str | None

Get the name of the schema.

Returns:

property pydantic_root_model: BaseModel

Get the pydantic root model.

If none is set, then generate it from the schema.

Returns:

property schema: SchemaDefinition

Get the schema

Returns:

schema_source: Union[str, Path, SchemaDefinition] = None
property schemaview: SchemaView

Get the schema view.

Returns:

curate_gpt.store.vocab module

Module contents

Adapters for different document stores.

Currently only one implementation is provided, for ChromaDB. In future this will index

  • MongoDB

  • ElasticSearch

  • Solr

  • Postgres

  • SQLite

Note: this package may become an independent project called linkml-store in the future.

class curate_gpt.store.ChromaDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, client=None, id_field='id', text_lookup='text', id_to_object=<factory>, **_kwargs)

Bases: DBAdapter

An Adapter that wraps a ChromaDB client

client: ClientAPI = None
collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for a collection.

Parameters:

collection_name (Optional[str])

Return type:

Optional[CollectionMetadata]

Returns:

Parameters

collections()

Return the names of all collections in the database.

Return type:

Iterator[str]

Returns:

default_max_document_length: ClassVar[int] = 6000
default_model = 'all-MiniLM-L6-v2'
Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

dump_then_load(collection=None, target=None)

Dump a collection to a file, then load it into another database.

Parameters:
  • collection (Optional[str])

  • target (Optional[DBAdapter])

Returns:

fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

find(where=None, projection=None, collection=None, **kwargs)

Query the database.

>>> from curate_gpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
Parameters:
  • collection (Optional[str])

  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None])

  • projection (Union[str, List[str], None])

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

id_field: str = 'id'
id_to_object: Mapping[str, Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]
insert(objs, **kwargs)

Insert an object or list of objects into the store.

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")
Parameters:
Returns:

list_collection_names()

List all collections in the database.

Return type:

List[str]

Returns:

lookup(id, collection=None, **kwargs)

Lookup an object by its ID.

Parameters:
  • id (str)

  • collection (Optional[str])

Return type:

Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]

Returns:

matches(obj, **kwargs)

Query the database for matches to an object.

Parameters:
Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

name: ClassVar[str] = 'chromadb'
peek(collection=None, limit=5, offset=0, **kwargs)

Peek at first N objects in a collection.

Parameters:
  • collection (Optional[str])

  • limit

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove a collection from the database.

Parameters:
  • collection (Optional[str])

  • exists_ok

Returns:

reset()

Reset/delete the database.

search(text, **kwargs)

Query the database for a text string.

>>> from curate_gpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"):
...     obj_id = obj["id"]
...     # print at precision of 2 decimal places
...     print(f"{obj_id} {distance:.2f}")

...
NeuronOfTheForebrain 0.28
...
Parameters:
  • text (str)

  • collection

  • where

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

tuple of object, distance, metadata

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for a collection.

Parameters:
Returns:

text_lookup: Union[str, Callable, None] = 'text'
update(objs, **kwargs)

Update an object or list of objects in the store.

Parameters:
Returns:

update_collection_metadata(collection_name, **kwargs)

Update the metadata for a collection.

Parameters:
  • collection_name (str)

  • kwargs

Return type:

CollectionMetadata

Returns:

upsert(objs, **kwargs)

Update an object or list of objects in the store.

Parameters:
Returns:

class curate_gpt.store.CollectionMetadata(**data)

Bases: BaseModel

Metadata about a collection.

This is an open class, so additional metadata can be added.

annotations: Optional[Dict]

Additional metadata

description: Optional[str]

Description of the collection

hnsw_space: Optional[str]

Space used for hnsw index (e.g. ‘cosine’)

model: Optional[str]

Name of any ML model

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'annotations': FieldInfo(annotation=Union[Dict, NoneType], required=False, default=None), 'description': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'hnsw_space': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'model': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'object_count': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'object_type': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'source': FieldInfo(annotation=Union[str, NoneType], required=False, default=None)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

name: Optional[str]

Name of the collection

object_count: Optional[int]

Number of objects in the collection

object_type: Optional[str]

Type of object in the collection

source: Optional[str]

Source of the collection

class curate_gpt.store.DBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None)

Bases: ABC

Base class for stores.

This base class provides a common interface for a wide variety of document or object stores. The interface is intended to closely mimic the kind of interface found for document stores such as mongoDB or vector databases such as ChromaDB, but the intention is that can be used for SQL databases, SPARQL endpoints, or even file systems.

The store allows for storage and retrieval of objects which are arbitrary dictionary objects, equivalient to a JSON object.

Objects are partitioned into collections, which maps to the equivalent concept in MongoDB and ChromaDB.

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert({"name": "John", "age": 42}, collection="people")

If you are used to working with MongoDB and ChromaDB APIs directly, one difference is that here we do not provide a separate Collection object, everything is handled through the store object. You can optionally bind a store object to a collection, which effectively gives you a collection object:

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.set_collection("people")
>>> store.insert({"name": "John", "age": 42})

TODO: decide if this is the final interface

collection: Optional[str] = None

Default collection

abstract collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for a collection.

Parameters:
  • collection_name (Optional[str])

  • include_derived – Include derived metadata, e.g. counts

Return type:

Optional[CollectionMetadata]

Returns:

create_view(view_name, collection, expression, **kwargs)

Create a view in the database.

Todo:

param view:

return:

delete(id, collection=None, **kwargs)

Delete an object by its ID.

Parameters:
  • id (str)

  • collection (Optional[str])

Returns:

dump(collection=None, to_file=None, metadata_to_file=None, format=None, include=None, **kwargs)

Dump the database to a file.

Parameters:
  • collection (Optional[str])

  • kwargs

Returns:

dump_then_load(collection=None, target=None)

Dump a collection to a file, then load it into another database.

Parameters:
  • collection (Optional[str])

  • target (Optional[DBAdapter])

Returns:

abstract fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

field_names(collection=None)

Return the names of all top level fields in the database for a collection.

Parameters:

collection (Optional[str])

Return type:

List[str]

Returns:

find(where=None, projection=None, collection=None, **kwargs)

Query the database.

>>> from curate_gpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))
Parameters:
  • collection (Optional[str])

  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None])

  • projection (Union[str, List[str], None])

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

identifier_field(collection=None)
Return type:

str

abstract insert(objs, collection=None, **kwargs)

Insert an object or list of objects into the store.

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")
Parameters:
Returns:

label_field(collection=None)
Return type:

str

abstract list_collection_names()

List all collections in the database.

Return type:

List[str]

Returns:

names of collections

abstract lookup(id, collection=None, **kwargs)

Lookup an object by its ID.

Parameters:
  • id (str)

  • collection (Optional[str])

Return type:

Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]

Returns:

lookup_multiple(ids, **kwargs)

Lookup an object by its ID.

Parameters:
  • id

  • collection

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

abstract matches(obj, **kwargs)

Query the database for matches to an object.

Parameters:
Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

name: ClassVar[str] = 'base'
path: str = None

Path to a location where the database is stored or disk or the network.

abstract peek(collection=None, limit=5, **kwargs)

Peek at first N objects in a collection.

Parameters:
  • collection (Optional[str])

  • limit

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove a collection from the database.

Parameters:

collection (Optional[str])

Returns:

schema_proxy: Optional[SchemaProxy] = None

Schema manager

abstract search(text, where=None, collection=None, **kwargs)

Query the database for a text string.

>>> from curate_gpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"):
...     obj_id = obj["id"]
...     # print at precision of 2 decimal places
...     print(f"{obj_id} {distance:.2f}")

...
NeuronOfTheForebrain 0.28
...
Parameters:
  • text (str)

  • collection (Optional[str])

  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None])

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

tuple of object, distance, metadata

set_collection(collection)

Set the current collection.

If this is set, then all subsequent operations will be performed on this collection, unless overridden.

This allows the following

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.set_collection("people")
>>> store.insert([{"name": "John", "age": 42}])

to be written in place of

>>> from curate_gpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")
Parameters:

collection (str)

Returns:

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for a collection.

>>> from curate_gpt.store import get_store
>>> from curate_gpt.store import CollectionMetadata
>>> store = get_store("in_memory")
>>> cm = CollectionMetadata(name="People", description="People in the database")
>>> store.set_collection_metadata("people", cm)
Parameters:

collection_name (Optional[str])

Returns:

update(objs, collection=None, **kwargs)

Update an object or list of objects in the store.

Parameters:
Returns:

update_collection_metadata(collection_name, **kwargs)

Update the metadata for a collection.

Parameters:
  • collection_name (str)

  • kwargs

Return type:

CollectionMetadata

Returns:

upsert(objs, collection=None, **kwargs)

Upsert an object or list of objects in the store.

Parameters:
Returns:

class curate_gpt.store.DuckDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, default_model='all-MiniLM-L6-v2', ef_construction=128, ef_search=64, M=16, distance_metric='cosine', id_field='id', text_lookup='text', id_to_object=<factory>, openai_client=None, **_kwargs)

Bases: DBAdapter

M: int = 16
collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for the collection :type collection_name: Optional[str] :param collection_name: :type include_derived: :param include_derived: :type kwargs: :param kwargs: :rtype: Optional[CollectionMetadata] :return:

conn: DuckDBPyConnection
create_index(collection)

Create an index for the given collection Parameters ———- collection

Returns

default_max_document_length: ClassVar[int] = 6000
default_model: str = 'all-MiniLM-L6-v2'
static determine_fields_to_include(include=None)

Determine which fields to include in the SQL query based on the ‘include’ parameter.

Parameters:

include (Optional[List[str]]) – List of fields to include in the output [‘metadata’, ‘embeddings’, ‘documents’]

Return type:

str

Returns:

Comma-separated string of fields to include

distance_metric: str = 'cosine'
dump_then_load(collection=None, target=None)

Dump the collection to a file and then load it into the target adapter :type collection: Optional[str] :param collection: :type target: Optional[DBAdapter] :param target: :param temp_file: :param format: :return:

ef_construction: int = 128
fetch_all_objects_memory_safe(collection=None, batch_size=100, include=None, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

find(where=None, projection=None, collection=None, include=None, limit=10, **kwargs)

Find objects in the collection that match the given query and projection

Parameters:
  • where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None]) – the query to filter the results

  • projection (Union[str, List[str], None])

  • collection (Optional[str]) – name of the collection to search

  • include – fields to be included in output

  • limit (int) – maximum number of results to return

  • kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

Parameters

get_raw_objects(collection)

Get all raw objects in the collection as they were inserted into the database :type collection: :param collection: :rtype: Iterator[Dict] :return:

id_field: str = 'id'
id_to_object: Mapping[str, dict]
identifier_field(collection=None)
Return type:

str

insert(objs, **kwargs)

Insert objects into the collection :type objs: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, Iterable[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]] :param objs: :type kwargs: :param kwargs: :return:

static kill_process(pid)

Kill the process with the given PID Returns ——-

list_collection_names()

List the names of all collections in the database :return:

lookup(id, collection=None, include=None, **kwargs)

Lookup an object by its id :type id: str :param id: ID of the object to lookup :type collection: Optional[str] :param collection: Name of the collection to search :type include: :param include: List of fields to include in the output [‘metadata’, ‘embeddings’, ‘documents’] :type kwargs: :param kwargs: :rtype: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult] :return:

matches(obj, include=None, **kwargs)

Find objects in the collection that match the given object :type obj: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult] :param obj: :type include: :param include: :type kwargs: :param kwargs: :rtype: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]] :return:

name: ClassVar[str] = 'duckdb'
openai_client: OpenAI = None
static parse_duckdb_result(results, include)

Parse the results from the SQL :rtype: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]] :return: DuckDBSearchResultIterator ———-

peek(collection=None, limit=5, include=None, offset=0, **kwargs)

Peek at the first N objects in the collection :type collection: Optional[str] :param collection: :type limit: :param limit: :type include: :param include: :type offset: int :param offset: :type kwargs: :param kwargs: :rtype: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]] :return:

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove the collection from the database :type collection: Optional[str] :param collection: :type exists_ok: :param exists_ok: :type kwargs: :param kwargs: :return:

search(text, where=None, collection=None, limit=10, relevance_factor=None, include=None, **kwargs)

Search for objects in the collection that match the given text :type text: str :param text: :type where: Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult, None] :param where: :type collection: Optional[str] :param collection: :type limit: int :param limit: :type relevance_factor: Optional[float] :param relevance_factor: :type include: :param include: :type kwargs: :param kwargs: :rtype: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]] :return:

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for the collection :type collection_name: Optional[str] :param collection_name: :type metadata: CollectionMetadata :param metadata: :type kwargs: :param kwargs: :return:

text_lookup: Union[str, Callable, None] = 'text'
update(objs, **kwargs)

Update objects in the collection. :type objs: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, Iterable[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]] :param objs: :type kwargs: :param kwargs: :return:

update_collection_metadata(collection, **kwargs)

Update the metadata for a collection. This function will merge new metadata provided via kwargs with existing metadata, if any, ensuring that only the specified fields are updated. :type collection: str :param collection: :type kwargs: :param kwargs: :return:

upsert(objs, **kwargs)

Upsert objects into the collection :type objs: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, Iterable[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]] :param objs: :type kwargs: :param kwargs: :return:

vec_dimension: int
class curate_gpt.store.SchemaProxy(schema_source=None, _pydantic_root_model=None, _schemaview=None)

Bases: object

Manage connection to a schema

json_schema()

Get the JSON schema translation of the schema. :rtype: Dict :return:

model_config = {'protected_namespaces': ()}
property name: str | None

Get the name of the schema.

Returns:

property pydantic_root_model: BaseModel

Get the pydantic root model.

If none is set, then generate it from the schema.

Returns:

property schema: SchemaDefinition

Get the schema

Returns:

schema_source: Union[str, Path, SchemaDefinition] = None
property schemaview: SchemaView

Get the schema view.

Returns:

curate_gpt.store.get_store(name, *args, **kwargs)
Return type:

DBAdapter