generative_ai.information_retrieval.step_1_retrieval module#
Define functionalities to store document embeddings.
- create_document_embedder(embedding_model: str) HuggingFaceEmbeddings#
Prepare a Sentence Transformers model for document embedding.
- Parameters:
embedding_model (
str) -- name of Sentence Transformers model from Hugging Face- Returns:
document embedder
- Return type:
HuggingFaceEmbeddings
- create_vector_store(embedder: HuggingFaceEmbeddings, directory_path: pathlib.Path) Chroma#
Initialise a Chroma vector store.
- Parameters:
embedder (
HuggingFaceEmbeddings) -- document embedderdirectory_path (
pathlib.Path) -- path to directory for storing vector store
- Returns:
vector store
- Return type:
Chroma
- load_json_documents(file_path: pathlib.Path) list[Document]#
Load retrieval documents from a JSON file.
- Parameters:
file_path (
pathlib.Path) -- path to JSON file- Returns:
retrieval documents
- Return type:
list[Document]
- partition_documents(raw_documents: list[Document]) list[Document]#
Partition retrieval documents into chunks.
- Parameters:
raw_documents (
list[Document]) -- retrieval documents- Returns:
chunks of retrieval documents
- Return type:
list[Document]
Notes
Chunk length will be at most 512 tokens.
Different chunks from same document will overlap by 64 tokens.