curategpt package

Subpackages

Submodules

curategpt.cli module

Command line interface for curategpt.

Module contents

CurateGPT: A framework semi-assisted curation of knowledge bases.

Architecture

store: json object stores that allow for embedding based search
wrappers: wraps external APIs and data sources for ingest
extract: extraction of json objects from LLMs
agents: agents that chain together search and generate components
formatters: formats data objects for presentation to humans and machine agents
app: streamlit application

class curategpt.BasicExtractor(schema_proxy=None, model_name='gpt-4o', api_key=None, raise_error_if_unparsable=False, serialization_format='json')

Bases: Extractor

Extractor that is purely example driven.

deserialize(text, format=None, **kwargs)

Deserialize text into an annotated object

Parameters:: text (str)
Return type:: AnnotatedObject
Returns:

deserialize_yaml(text, multiple=False)

Return type:: AnnotatedObject

extract(text, target_class, examples=None, background_text=None, rules=None, min_examples=1, **kwargs)

Schema-guided extraction

Parameters:

text (str)
kwargs

Return type:

AnnotatedObject

Returns:

model_config = {'protected_namespaces': ()}

model_name: str = 'gpt-4o'

serialization_format: str = 'json'

serialize(ao)

Return type:: str

class curategpt.ChromaDBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None, default_model='all-MiniLM-L6-v2', client=None, id_field='id', text_lookup='text', id_to_object=<factory>)

Bases: DBAdapter

An Adapter that wraps a ChromaDB client

client: ClientAPI = None

collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for a collection.

Parameters:: collection_name (Optional[str])
Return type:: Optional[Metadata]
Returns:

Parameters

collections()

Return the names of all collections in the database.

Return type:: Iterator[str]
Returns:

default_max_document_length: ClassVar[int] = 6000

default_model: str = 'all-MiniLM-L6-v2'

diversified_search(text=None, limit=None, relevance_factor=0.5, collection=None, **kwargs)

Return type:: Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

dump_then_load(collection=None, target=None)

Dump a collection to a file, then load it into another database.

Parameters:

collection (str)
target (DBAdapter)

Returns:

fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:: Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

find(where=None, projection=None, collection=None, **kwargs)

Query the database.

>>> from curategpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))

Parameters:

collection (str)
where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult])
projection (Union[str, List[str]])
kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

id_field: str = 'id'

id_to_object: Mapping[str, Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]] = <dataclasses._MISSING_TYPE object>

insert(objs, **kwargs)

Insert an object or list of objects into the store.

>>> from curategpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")

Parameters:

objs (Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, Iterable[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]])
collection

Returns:

insert_from_huggingface(objs, collection=None, batch_size=None, text_field=None, venomx=None, method_name='add', **kwargs)

list_collection_names()

List all collections in the database.

Return type:: List[str]
Returns:

lookup(id, collection=None, **kwargs)

Lookup an object by its ID.

Parameters:

id (str)
collection (str)

Return type:

Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]

Returns:

matches(obj, **kwargs)

Query the database for matches to an object.

Parameters:

obj (Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult])
kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

name: ClassVar[str] = 'chromadb'

normalize_metadata(metadata)

Normalize metadata downloaded from huggingface. Transformation to parquet forces nested lists to be turned into array type so we flatten those again.

Return type:: Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]

nparray_to_list(obj)

peek(collection=None, limit=5, offset=0, **kwargs)

Peek at first N objects in a collection.

Parameters:

collection (str)
limit

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

static populate_venomx(collection, model, existing_venomx)

Return type:: Index

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove a collection from the database.

Parameters:

collection (str)
exists_ok

Returns:

reset(): Reset/delete the database.

search(text, **kwargs)

Query the database for a text string.

>>> from curategpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"):
...     obj_id = obj["id"]
...     # print at precision of 2 decimal places
...     print(f"{obj_id} {distance:.2f}")

...
NeuronOfTheForebrain 0.28
...

Parameters:

text (str)
collection
where
kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

tuple of object, distance, metadata

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for a collection.

Parameters:

collection_name (Optional[str])
metadata (Metadata)

Return type:

Union[Metadata, Dict]

Returns:

text_lookup: Union[str, Callable, None] = 'text'

update(objs, **kwargs)

Update an object or list of objects in the store.

Parameters:

objs (Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, List[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]])
collection

Returns:

update_collection_metadata(collection_name, **kwargs)

Update the metadata for a collection based on the adapter.

Parameters:

collection_name (str) – Name of the collection.
kwargs – Additional metadata fields.

Return type:

Metadata

Returns:

Updated Metadata instance.

upsert(objs, **kwargs)

Update an object or list of objects in the store.

Parameters:

objs (Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, List[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]])
collection

Returns:

class curategpt.DBAdapter(path=None, schema_proxy=None, collection=None, _field_names_by_collection=None)

Bases: ABC

Base class for stores.

This base class provides a common interface for a wide variety of document or object stores. The interface is intended to closely mimic the kind of interface found for document stores such as mongoDB or vector databases such as ChromaDB, but the intention is that can be used for SQL databases, SPARQL endpoints, or even file systems.

The store allows for storage and retrieval of objects which are arbitrary dictionary objects, equivalient to a JSON object.

Objects are partitioned into collections, which maps to the equivalent concept in MongoDB and ChromaDB.

>>> from curategpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert({"name": "John", "age": 42}, collection="people")

If you are used to working with MongoDB and ChromaDB APIs directly, one difference is that here we do not provide a separate Collection object, everything is handled through the store object. You can optionally bind a store object to a collection, which effectively gives you a collection object:

>>> from curategpt.store import get_store
>>> store = get_store("in_memory")
>>> store.set_collection("people")
>>> store.insert({"name": "John", "age": 42})

TODO: decide if this is the final interface

collection: Optional[str] = None: Default collection

abstractmethod collection_metadata(collection_name=None, include_derived=False, **kwargs)

Get the metadata for a collection.

Parameters:

collection_name (Optional[str])
include_derived – Include derived metadata, e.g. counts

Return type:

Optional[Metadata]

Returns:

create_view(view_name, collection, expression, **kwargs)

Create a view in the database.

Todo:

param view:
return:

delete(id, collection=None, **kwargs)

Delete an object by its ID.

Parameters:

id (str)
collection (str)

Returns:

dump(collection=None, to_file=None, metadata_to_file=None, format=None, include=None, **kwargs)

Dump the database to a file.

Parameters:

collection (str)
kwargs

Returns:

dump_then_load(collection=None, target=None)

Dump a collection to a file, then load it into another database.

Parameters:

collection (str)
target (DBAdapter)

Returns:

abstractmethod fetch_all_objects_memory_safe(collection=None, batch_size=100, **kwargs)

Fetch all objects from a collection, in batches to avoid memory overload.

Return type:: Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

field_names(collection=None)

Return the names of all top level fields in the database for a collection.

Parameters:: collection (str)
Return type:: List[str]
Returns:

find(where=None, projection=None, collection=None, **kwargs)

Query the database.

>>> from curategpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> objs = list(store.find({"name": "NeuronOfTheForebrain"}, collection="ont_cl"))

Parameters:

collection (str)
where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult])
projection (Union[str, List[str]])
kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

identifier_field(collection=None)

Return type:: str

abstractmethod insert(objs, collection=None, **kwargs)

Insert an object or list of objects into the store.

>>> from curategpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")

Parameters:

objs (Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, Iterable[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]])
collection (str)

Returns:

insert_from_huggingface(objs, collection=None, **kwargs)

label_field(collection=None)

Return type:: str

abstractmethod list_collection_names()

List all collections in the database.

Return type:: List[str]
Returns:: names of collections

abstractmethod lookup(id, collection=None, **kwargs)

Lookup an object by its ID.

Parameters:

id (str)
collection (str)

Return type:

Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]

Returns:

lookup_multiple(ids, **kwargs)

Lookup an object by its ID.

Parameters:

id
collection

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

abstractmethod matches(obj, **kwargs)

Query the database for matches to an object.

Parameters:

obj (Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult])
kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

name: ClassVar[str] = 'base'

path: str = None: Path to a location where the database is stored or disk or the network.

abstractmethod peek(collection=None, limit=5, **kwargs)

Peek at first N objects in a collection.

Parameters:

collection (str)
limit

Return type:

Iterator[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]

Returns:

remove_collection(collection=None, exists_ok=False, **kwargs)

Remove a collection from the database.

Parameters:: collection (str)
Returns:

schema_proxy: Optional[SchemaProxy] = None: Schema manager

abstractmethod search(text, where=None, collection=None, **kwargs)

Query the database for a text string.

>>> from curategpt.store import get_store
>>> store = get_store("chromadb", "db")
>>> for obj, distance, info in store.search("forebrain neurons", collection="ont_cl"):
...     obj_id = obj["id"]
...     # print at precision of 2 decimal places
...     print(f"{obj_id} {distance:.2f}")

...
NeuronOfTheForebrain 0.28
...

Parameters:

text (str)
collection (str)
where (Union[str, YAMLRoot, BaseModel, Dict, DuckDBSearchResult])
kwargs

Return type:

Iterator[Tuple[DuckDBSearchResult, Dict, float, Optional[Dict]]]

Returns:

tuple of object, distance, metadata

set_collection(collection)

Set the current collection.

If this is set, then all subsequent operations will be performed on this collection, unless overridden.

This allows the following

>>> from curategpt.store import get_store
>>> store = get_store("in_memory")
>>> store.set_collection("people")
>>> store.insert([{"name": "John", "age": 42}])

to be written in place of

>>> from curategpt.store import get_store
>>> store = get_store("in_memory")
>>> store.insert([{"name": "John", "age": 42}], collection="people")

Parameters:: collection (str)
Returns:

set_collection_metadata(collection_name, metadata, **kwargs)

Set the metadata for a collection.

>>> from curategpt.store import get_store
>>> from curategpt.store import Metadata
>>> store = get_store("in_memory")
>>> md = store.collection_metadata(collection)
>>> md.venomx.id == "People"
>>> md.venomx.embedding_model.name == "openai:"
>>> store.set_collection_metadata("people", cm)

Parameters:: collection_name (Optional[str])
Returns:

update(objs, collection=None, **kwargs)

Update an object or list of objects in the store.

Parameters:

objs (Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, List[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]])
collection (str)

Returns:

update_collection_metadata(collection_name, **kwargs)

Update the metadata for a collection.

Parameters:

collection_name (str)
kwargs

Return type:

Metadata

Returns:

upsert(objs, collection=None, **kwargs)

Upsert an object or list of objects in the store.

Parameters:

objs (Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult, List[Union[YAMLRoot, BaseModel, Dict, DuckDBSearchResult]]])
collection (str)

Returns:

class curategpt.Extractor(schema_proxy=None, model_name=None, api_key=None, raise_error_if_unparsable=False)

Bases: ABC

api_key: str = None

deserialize(text, **kwargs)

Deserialize text into an annotated object

Parameters:: text (str)
Return type:: AnnotatedObject
Returns:

abstractmethod extract(text, target_class, examples=None, **kwargs)

Schema-guided extraction

Parameters:

text (str)
kwargs

Return type:

AnnotatedObject

Returns:

property model

Get the model

Parameters:: model_name
Returns:

model_name: str = None

property pydantic_root_model: BaseModel

raise_error_if_unparsable: bool = False

schema_proxy: SchemaProxy = None

property schemaview: SchemaView