generative_ai.dataset_generation package#

Submodules#

Module contents#

Define functionalities for dataset generation.

class JSONDataset(*, retrieval_documents: list[str], tuning_documents: list[JSONDocument])#

Bases: BaseModel

Store all details for querying a package documentation in JSON format.

retrieval_documents#

chunks of text to be used for retrieval

Type:

list[str]

tuning_documents#

pairs of question and answer to be used for tuning

Type:

list[JSONDocument]

retrieval_documents: list[str]#
tuning_documents: list[JSONDocument]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'retrieval_documents': FieldInfo(annotation=list[str], required=True), 'tuning_documents': FieldInfo(annotation=list[JSONDocument], required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class JSONDocument(*, context: str, question: str, answer: str, split: SplitName)#

Bases: BaseModel

Store details of a document in JSON format.

context#

details containing the description

Type:

str

question#

question to be answered or instructions to follow using the context

Type:

str

answer#

answer to the question or instruction based on the context

Type:

str

split#

split allocation of the document

Type:

SplitName

context: str#
question: str#
answer: str#
split: SplitName#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'answer': FieldInfo(annotation=str, required=True), 'context': FieldInfo(annotation=str, required=True), 'question': FieldInfo(annotation=str, required=True), 'split': FieldInfo(annotation=SplitName, required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

generate_json_dataset(raw_datasets: list[Dataset]) JSONDataset#

Convert raw documents into JSON format.

Parameters:

raw_datasets (list[Dataset]) -- all retrieval and tuning documents for root package and its contents

Returns:

all details for querying a package documentation in JSON format

Return type:

JSONDataset

generate_member_dataset(member_details: MemberDetails) tuple[Dataset, ...]#

Create a dataset for a member.

Parameters:

member_details (MemberDetails) -- all details of the member

Returns:

all documents for retrieval and tuning for querying member documentation

Return type:

tuple[Dataset, ]

Raises:

ValueError -- if the member type is not supported

Notes

  • There will be a single return if member type is not enum, class or function.

  • Otherwise, there will be two returns, one for the member and one for the member type.

generate_module_dataset(module_contents: ModuleDetails) Dataset#

Create relevant question and answers based on module details.

Parameters:

module_contents (ModuleDetails) -- details of a python module

Returns:

all documents for retrieval and tuning for querying module documentation

Return type:

Dataset

generate_package_dataset(package_contents: PackageDetails) Dataset#

Create relevant question and answers based on package details.

Parameters:

package_contents (PackageDetails) -- details of a python package

Returns:

all documents for retrieval and tuning for querying package documentation

Return type:

Dataset

generate_raw_datasets(package_name: str) list[Dataset]#

Generate all retrieval and tuning documents for exploring documentation of a package.

Parameters:

package_name (str) -- name of the root package to import with

Returns:

all retrieval and tuning documents for root package and its contents

Return type:

list[Dataset]

get_all_member_details(module_name: str, member_name: str, member_object: Any) MemberDetails#

Extract all details of a module object.

Parameters:
  • module_name (str) -- fully qualified name of the module

  • member_name (str) -- name of the object

  • member_object (_type_) -- original object

Returns:

all details of the object

Return type:

MemberDetails

get_all_module_contents(module_name: str) ModuleDetails#

Extract all details of a module.

Parameters:

module_name (str) -- name of the module to import with

Returns:

details of the module

Return type:

ModuleDetails

get_all_package_contents(package_name: str) list[PackageDetails]#

Extract all details of a root package.

Parameters:

package_name (str) -- name of the root package to import with

Returns:

all details of the root package and its sub-packages

Return type:

list[PackageDetails]

load_json_dataset(file_path: Path) JSONDataset#

Load JSON dataset from a JSON file.

Parameters:

file_path (pathlib.Path) -- path to load JSON dataset from

Returns:

all details for querying a package documentation in JSON format

Return type:

JSONDataset

store_json_dataset(json_dataset: JSONDataset, file_path: Path) None#

Dump JSON dataset into a JSON file.

Parameters:
  • json_dataset (JSONDataset) -- all details for querying a package documentation in JSON format

  • file_path (pathlib.Path) -- path to store JSON dataset