curategpt.extract package

Submodules

curategpt.extract.basic_extractor module

Basic Extractor that is purely example driven.

class curategpt.extract.basic_extractor.BasicExtractor(schema_proxy=None, model_name='gpt-4o', api_key=None, raise_error_if_unparsable=False, serialization_format='json')

Bases: Extractor

Extractor that is purely example driven.

deserialize(text, format=None, **kwargs)

Deserialize text into an annotated object

Parameters:

text (str)

Return type:

AnnotatedObject

Returns:

deserialize_yaml(text, multiple=False)
Return type:

AnnotatedObject

extract(text, target_class, examples=None, background_text=None, rules=None, min_examples=1, **kwargs)

Schema-guided extraction

Parameters:
  • text (str)

  • kwargs

Return type:

AnnotatedObject

Returns:

model_config = {'protected_namespaces': ()}
model_name: str = 'gpt-4o'
serialization_format: str = 'json'
serialize(ao)
Return type:

str

curategpt.extract.extractor module

Retrieval Augmented Generation (RAG) Base Class.

class curategpt.extract.extractor.AnnotatedObject(**data)

Bases: BaseModel

Annotated object shadows a basic dictionary object

annotations: Dict[str, Any]
as_single_object()

Return as a standard dictionary object.

Each annotation is prefixed with an underscore.

Return type:

Dict[str, Any]

Returns:

dictionary object

key_values: Dict[str, AnnotatedObject]
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'annotations': FieldInfo(annotation=Dict[str, Any], required=False, default={}), 'key_values': FieldInfo(annotation=Dict[str, curategpt.extract.extractor.AnnotatedObject], required=False, default={}), 'object': FieldInfo(annotation=Any, required=False, default={})}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

object: Any
property text: str | None

Get the text annotation of the object.

Returns:

class curategpt.extract.extractor.Extractor(schema_proxy=None, model_name=None, api_key=None, raise_error_if_unparsable=False)

Bases: ABC

api_key: str = None
deserialize(text, **kwargs)

Deserialize text into an annotated object

Parameters:

text (str)

Return type:

AnnotatedObject

Returns:

abstract extract(text, target_class, examples=None, **kwargs)

Schema-guided extraction

Parameters:
  • text (str)

  • kwargs

Return type:

AnnotatedObject

Returns:

property model

Get the model

Parameters:

model_name

Returns:

model_name: str = None
property pydantic_root_model: BaseModel
raise_error_if_unparsable: bool = False
schema_proxy: SchemaProxy = None
property schemaview: SchemaView

curategpt.extract.openai_extractor module

Extractor that uses OpenAI functions.

class curategpt.extract.openai_extractor.OpenAIExtractor(schema_proxy=None, model_name=None, api_key=None, raise_error_if_unparsable=False, max_tokens=3000, model='gpt-4')

Bases: Extractor

Extractor that uses OpenAI functions.

extract(text, target_class, examples=None, examples_as_functions=False, conversation=None, **kwargs)

Schema-guided extraction

Parameters:
  • text (str)

  • kwargs

Return type:

AnnotatedObject

Returns:

functions()
max_tokens: int = 3000
model: str = 'gpt-4'

curategpt.extract.recursive_extractor module

Basic Extractor that is purely example driven.

class curategpt.extract.recursive_extractor.RecursiveExtractor(schema_proxy=None, model_name='gpt-3.5-turbo', api_key=None, raise_error_if_unparsable=False, serialization_format='json')

Bases: Extractor

Extractor that recursively extracts objects from text.

See SPIRES

deserialize(text)

Deserialize text into an annotated object

Parameters:

text (str)

Return type:

AnnotatedObject

Returns:

extract(text, target_class, examples=None, path=None, **kwargs)

Schema-guided extraction

Parameters:
  • text (str)

  • kwargs

Return type:

AnnotatedObject

Returns:

model_name: str = 'gpt-3.5-turbo'
partially_serialize(object, path)
Return type:

str

serialization_format: str = 'json'

Module contents

CurateGPT Extractors.

These handle connections to (remote or local) LLMs, and can also extract structured objects from text.

class curategpt.extract.AnnotatedObject(**data)

Bases: BaseModel

Annotated object shadows a basic dictionary object

annotations: Dict[str, Any]
as_single_object()

Return as a standard dictionary object.

Each annotation is prefixed with an underscore.

Return type:

Dict[str, Any]

Returns:

dictionary object

key_values: Dict[str, AnnotatedObject]
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'protected_namespaces': ()}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'annotations': FieldInfo(annotation=Dict[str, Any], required=False, default={}), 'key_values': FieldInfo(annotation=Dict[str, curategpt.extract.extractor.AnnotatedObject], required=False, default={}), 'object': FieldInfo(annotation=Any, required=False, default={})}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

object: Any
property text: str | None

Get the text annotation of the object.

Returns:

class curategpt.extract.BasicExtractor(schema_proxy=None, model_name='gpt-4o', api_key=None, raise_error_if_unparsable=False, serialization_format='json')

Bases: Extractor

Extractor that is purely example driven.

deserialize(text, format=None, **kwargs)

Deserialize text into an annotated object

Parameters:

text (str)

Return type:

AnnotatedObject

Returns:

deserialize_yaml(text, multiple=False)
Return type:

AnnotatedObject

extract(text, target_class, examples=None, background_text=None, rules=None, min_examples=1, **kwargs)

Schema-guided extraction

Parameters:
  • text (str)

  • kwargs

Return type:

AnnotatedObject

Returns:

model_config = {'protected_namespaces': ()}
model_name: str = 'gpt-4o'
serialization_format: str = 'json'
serialize(ao)
Return type:

str

class curategpt.extract.Extractor(schema_proxy=None, model_name=None, api_key=None, raise_error_if_unparsable=False)

Bases: ABC

api_key: str = None
deserialize(text, **kwargs)

Deserialize text into an annotated object

Parameters:

text (str)

Return type:

AnnotatedObject

Returns:

abstract extract(text, target_class, examples=None, **kwargs)

Schema-guided extraction

Parameters:
  • text (str)

  • kwargs

Return type:

AnnotatedObject

Returns:

property model

Get the model

Parameters:

model_name

Returns:

model_name: str = None
property pydantic_root_model: BaseModel
raise_error_if_unparsable: bool = False
schema_proxy: SchemaProxy = None
property schemaview: SchemaView
class curategpt.extract.OpenAIExtractor(schema_proxy=None, model_name=None, api_key=None, raise_error_if_unparsable=False, max_tokens=3000, model='gpt-4')

Bases: Extractor

Extractor that uses OpenAI functions.

extract(text, target_class, examples=None, examples_as_functions=False, conversation=None, **kwargs)

Schema-guided extraction

Parameters:
  • text (str)

  • kwargs

Return type:

AnnotatedObject

Returns:

functions()
max_tokens: int = 3000
model: str = 'gpt-4'
class curategpt.extract.RecursiveExtractor(schema_proxy=None, model_name='gpt-3.5-turbo', api_key=None, raise_error_if_unparsable=False, serialization_format='json')

Bases: Extractor

Extractor that recursively extracts objects from text.

See SPIRES

deserialize(text)

Deserialize text into an annotated object

Parameters:

text (str)

Return type:

AnnotatedObject

Returns:

extract(text, target_class, examples=None, path=None, **kwargs)

Schema-guided extraction

Parameters:
  • text (str)

  • kwargs

Return type:

AnnotatedObject

Returns:

model_name: str = 'gpt-3.5-turbo'
partially_serialize(object, path)
Return type:

str

serialization_format: str = 'json'