generative_ai.information_retrieval.step_1_retrieval module#

Define functionalities to store document embeddings.

create_document_embedder(embedding_model: str) HuggingFaceEmbeddings#

Prepare a Sentence Transformers model for document embedding.

Parameters:

embedding_model (str) -- name of Sentence Transformers model from Hugging Face

Returns:

document embedder

Return type:

HuggingFaceEmbeddings

create_vector_store(embedder: HuggingFaceEmbeddings, directory_path: pathlib.Path) Chroma#

Initialise a Chroma vector store.

Parameters:
  • embedder (HuggingFaceEmbeddings) -- document embedder

  • directory_path (pathlib.Path) -- path to directory for storing vector store

Returns:

vector store

Return type:

Chroma

load_json_documents(file_path: pathlib.Path) list[Document]#

Load retrieval documents from a JSON file.

Parameters:

file_path (pathlib.Path) -- path to JSON file

Returns:

retrieval documents

Return type:

list[Document]

partition_documents(raw_documents: list[Document]) list[Document]#

Partition retrieval documents into chunks.

Parameters:

raw_documents (list[Document]) -- retrieval documents

Returns:

chunks of retrieval documents

Return type:

list[Document]

Notes

  • Chunk length will be at most 512 tokens.

  • Different chunks from same document will overlap by 64 tokens.