Working retrieval with the cli
This commit is contained in:
@@ -32,7 +32,7 @@ Chosen data folder: relatve ./../../../data - from the current folder
|
|||||||
|
|
||||||
# Phase 5 (preparation for the retrieval feature)
|
# Phase 5 (preparation for the retrieval feature)
|
||||||
|
|
||||||
- [ ] Create file `retrieval.py` with the configuration for chosen RAG framework, that will retrieve data from the vector storage based on the query. Use retrieving library/plugin, that supports chosen vector storage within the chosen RAG framework. Retrieving configuration should search for the provided text in the query as argument in the function and return found information with the stored meta data, like paragraph, section, page etc.
|
- [x] Create file `retrieval.py` with the configuration for chosen RAG framework, that will retrieve data from the vector storage based on the query. Use retrieving library/plugin, that supports chosen vector storage within the chosen RAG framework. Retrieving configuration should search for the provided text in the query as argument in the function and return found information with the stored meta data, like paragraph, section, page etc. Important: if for chosen RAG framework, there is no need in separation of search, separation of retrieving from the chosen vector storage, this step may be skipped and marked done.
|
||||||
|
|
||||||
# Phase 6 (chat feature, as agent, for usage in the cli)
|
# Phase 6 (chat feature, as agent, for usage in the cli)
|
||||||
|
|
||||||
|
|||||||
@@ -71,8 +71,8 @@ The project is organized into 6 development phases as outlined in `PLANNING.md`:
|
|||||||
- [x] Integrate with CLI
|
- [x] Integrate with CLI
|
||||||
|
|
||||||
### Phase 5: Retrieval Feature
|
### Phase 5: Retrieval Feature
|
||||||
- [ ] Create `retrieval.py` for querying vector storage
|
- [x] Create `retrieval.py` for querying vector storage
|
||||||
- [ ] Implement metadata retrieval (filename, page, section, etc.)
|
- [x] Implement metadata retrieval (filename, page, section, etc.)
|
||||||
|
|
||||||
### Phase 6: Chat Agent
|
### Phase 6: Chat Agent
|
||||||
- [ ] Create `agent.py` with Ollama-powered chat agent
|
- [ ] Create `agent.py` with Ollama-powered chat agent
|
||||||
@@ -126,3 +126,50 @@ Since the project is in early development stages, the following steps are planne
|
|||||||
## Current Status
|
## Current Status
|
||||||
|
|
||||||
The project is in early development phase. The virtual environment is set up and dependencies are defined, but the core functionality (CLI, document loading, vector storage, etc.) is yet to be implemented according to the planned phases.
|
The project is in early development phase. The virtual environment is set up and dependencies are defined, but the core functionality (CLI, document loading, vector storage, etc.) is yet to be implemented according to the planned phases.
|
||||||
|
|
||||||
|
## Important Implementation Notes
|
||||||
|
|
||||||
|
### OCR and Computer Vision Setup
|
||||||
|
- Added Tesseract OCR support for image text extraction
|
||||||
|
- Installed `pytesseract` and `unstructured-pytesseract` packages
|
||||||
|
- Configured image loaders to use OCR strategy ("ocr_only") for extracting text from images
|
||||||
|
- This resolves the "OCRAgent instance" error when processing image files
|
||||||
|
|
||||||
|
### Russian Language Processing Configuration
|
||||||
|
- Installed spaCy library and Russian language model (`ru_core_news_sm`)
|
||||||
|
- Configured unstructured loaders to use Russian as the primary language (`"languages": ["rus"]`)
|
||||||
|
- This improves processing accuracy for Russian documents in the dataset
|
||||||
|
|
||||||
|
### Qdrant Collection Auto-Creation Fix
|
||||||
|
- Fixed issue where Qdrant collections were not being created automatically
|
||||||
|
- Implemented logic to check if collection exists and create it if needed
|
||||||
|
- Uses Qdrant client's `create_collection` method with proper vector size detection
|
||||||
|
- Resolves the "Collection doesn't exist" 404 error during document insertion
|
||||||
|
|
||||||
|
### Document Tracking Improvement
|
||||||
|
- Modified document tracking to only mark documents as processed after successful vector storage insertion
|
||||||
|
- This prevents documents from being marked as processed if vector storage insertion fails
|
||||||
|
|
||||||
|
### Dependency Management
|
||||||
|
- Added several new dependencies for enhanced functionality:
|
||||||
|
- `pdf2image` for PDF-to-image conversion
|
||||||
|
- `unstructured-inference` for advanced document analysis
|
||||||
|
- `python-pptx` for PowerPoint processing
|
||||||
|
- `pi-heif` for HEIF image format support
|
||||||
|
- `spacy` and `ru-core-news-sm` for Russian NLP
|
||||||
|
|
||||||
|
### Error Handling Improvements
|
||||||
|
- Enhanced error handling for optional dependencies in document loaders
|
||||||
|
- Added graceful degradation when optional modules are not available
|
||||||
|
|
||||||
|
### Phase 5 Implementation Notes
|
||||||
|
- Created `retrieval.py` module with LangChain Retriever functionality
|
||||||
|
- Implemented search functions that retrieve documents with metadata from Qdrant vector storage
|
||||||
|
- Added CLI command `retrieve` to search the vector database based on a query
|
||||||
|
- Retrieval returns documents with metadata including source, filename, page number, file extension, etc.
|
||||||
|
- Used QdrantVectorStore from langchain-qdrant package for compatibility with newer LangChain versions
|
||||||
|
|
||||||
|
### Troubleshooting Notes
|
||||||
|
- If encountering "No module named 'unstructured_inference'" error, install unstructured-inference
|
||||||
|
- If seeing OCR-related errors, ensure tesseract is installed at the system level and unstructured-pytesseract is available
|
||||||
|
- For language detection issues, verify that appropriate spaCy models are downloaded
|
||||||
@@ -62,5 +62,56 @@ def enrich(data_dir, collection_name):
|
|||||||
click.echo(f"Error: {str(e)}")
|
click.echo(f"Error: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
@cli.command(
|
||||||
|
name="retrieve",
|
||||||
|
help="Retrieve documents from vector database based on a query",
|
||||||
|
)
|
||||||
|
@click.argument("query")
|
||||||
|
@click.option(
|
||||||
|
"--collection-name",
|
||||||
|
default="documents_langchain",
|
||||||
|
help="Name of the vector store collection",
|
||||||
|
)
|
||||||
|
@click.option(
|
||||||
|
"--top-k",
|
||||||
|
default=5,
|
||||||
|
help="Number of documents to retrieve",
|
||||||
|
)
|
||||||
|
def retrieve(query, collection_name, top_k):
|
||||||
|
"""Retrieve documents from vector database based on a query"""
|
||||||
|
logger.info(f"Starting retrieval process for query: {query}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Import here to avoid circular dependencies
|
||||||
|
from retrieval import search_documents_with_metadata
|
||||||
|
|
||||||
|
# Perform retrieval
|
||||||
|
results = search_documents_with_metadata(
|
||||||
|
query=query,
|
||||||
|
collection_name=collection_name,
|
||||||
|
top_k=top_k
|
||||||
|
)
|
||||||
|
|
||||||
|
if not results:
|
||||||
|
click.echo("No relevant documents found for the query.")
|
||||||
|
return
|
||||||
|
|
||||||
|
click.echo(f"Found {len(results)} relevant documents:\n")
|
||||||
|
|
||||||
|
for i, result in enumerate(results, 1):
|
||||||
|
click.echo(f"{i}. Source: {result['source']}")
|
||||||
|
click.echo(f" Filename: {result['filename']}")
|
||||||
|
click.echo(f" Page: {result['page_number']}")
|
||||||
|
click.echo(f" File Extension: {result['file_extension']}")
|
||||||
|
click.echo(f" Content Preview: {result['content'][:200]}...")
|
||||||
|
click.echo(f" Metadata: {result['metadata']}\n")
|
||||||
|
|
||||||
|
logger.info("Retrieval process completed successfully!")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error during retrieval process: {str(e)}")
|
||||||
|
click.echo(f"Error: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
cli()
|
cli()
|
||||||
|
|||||||
154
services/rag/langchain/retrieval.py
Normal file
154
services/rag/langchain/retrieval.py
Normal file
@@ -0,0 +1,154 @@
|
|||||||
|
"""Retrieval module for querying vector storage and returning relevant documents with metadata."""
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import List, Optional
|
||||||
|
from langchain_core.retrievers import BaseRetriever
|
||||||
|
from langchain_core.callbacks import CallbackManagerForRetrieverRun
|
||||||
|
from langchain_core.documents import Document
|
||||||
|
from loguru import logger
|
||||||
|
|
||||||
|
from vector_storage import initialize_vector_store
|
||||||
|
|
||||||
|
|
||||||
|
class VectorStoreRetriever(BaseRetriever):
|
||||||
|
"""
|
||||||
|
A custom retriever that uses the Qdrant vector store to retrieve relevant documents.
|
||||||
|
"""
|
||||||
|
|
||||||
|
vector_store: object # Qdrant vector store instance
|
||||||
|
top_k: int = 5 # Number of documents to retrieve
|
||||||
|
|
||||||
|
def _get_relevant_documents(
|
||||||
|
self, query: str, *, run_manager: CallbackManagerForRetrieverRun
|
||||||
|
) -> List[Document]:
|
||||||
|
"""
|
||||||
|
Retrieve relevant documents based on the query.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query: The query string to search for
|
||||||
|
run_manager: Callback manager for the run
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of relevant documents with metadata
|
||||||
|
"""
|
||||||
|
logger.info(f"Searching for documents related to query: {query[:50]}...")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Perform similarity search on the vector store
|
||||||
|
results = self.vector_store.similarity_search(query, k=self.top_k)
|
||||||
|
|
||||||
|
logger.info(f"Found {len(results)} relevant documents")
|
||||||
|
|
||||||
|
return results
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error during similarity search: {str(e)}")
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def create_retriever(collection_name: str = "documents_langchain", top_k: int = 5):
|
||||||
|
"""
|
||||||
|
Create and return a retriever instance connected to the vector store.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
collection_name: Name of the Qdrant collection to use
|
||||||
|
top_k: Number of documents to retrieve
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
VectorStoreRetriever instance
|
||||||
|
"""
|
||||||
|
logger.info(f"Initializing vector store for retrieval from collection: {collection_name}")
|
||||||
|
|
||||||
|
# Initialize the vector store
|
||||||
|
vector_store = initialize_vector_store(collection_name=collection_name)
|
||||||
|
|
||||||
|
# Create and return the retriever
|
||||||
|
retriever = VectorStoreRetriever(vector_store=vector_store, top_k=top_k)
|
||||||
|
|
||||||
|
return retriever
|
||||||
|
|
||||||
|
|
||||||
|
def search_documents(query: str, collection_name: str = "documents_langchain", top_k: int = 5) -> List[Document]:
|
||||||
|
"""
|
||||||
|
Search for documents in the vector store based on the query.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query: The query string to search for
|
||||||
|
collection_name: Name of the Qdrant collection to use
|
||||||
|
top_k: Number of documents to retrieve
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of documents with metadata
|
||||||
|
"""
|
||||||
|
logger.info(f"Starting document search for query: {query}")
|
||||||
|
|
||||||
|
# Create the retriever
|
||||||
|
retriever = create_retriever(collection_name=collection_name, top_k=top_k)
|
||||||
|
|
||||||
|
# Perform the search
|
||||||
|
results = retriever.invoke(query)
|
||||||
|
|
||||||
|
logger.info(f"Search completed, returned {len(results)} documents")
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def search_documents_with_metadata(
|
||||||
|
query: str,
|
||||||
|
collection_name: str = "documents_langchain",
|
||||||
|
top_k: int = 5
|
||||||
|
) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Search for documents and return them with detailed metadata.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
query: The query string to search for
|
||||||
|
collection_name: Name of the Qdrant collection to use
|
||||||
|
top_k: Number of documents to retrieve
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of dictionaries containing document content and metadata
|
||||||
|
"""
|
||||||
|
logger.info(f"Starting document search with metadata for query: {query}")
|
||||||
|
|
||||||
|
# Initialize the vector store
|
||||||
|
vector_store = initialize_vector_store(collection_name=collection_name)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Standard similarity search
|
||||||
|
documents = vector_store.similarity_search(query, k=top_k)
|
||||||
|
|
||||||
|
# Format results to include content and metadata
|
||||||
|
formatted_results = []
|
||||||
|
for doc in documents:
|
||||||
|
formatted_result = {
|
||||||
|
"content": doc.page_content,
|
||||||
|
"metadata": doc.metadata,
|
||||||
|
"source": doc.metadata.get("source", "Unknown"),
|
||||||
|
"filename": doc.metadata.get("filename", "Unknown"),
|
||||||
|
"page_number": doc.metadata.get("page_number", doc.metadata.get("page", "N/A")),
|
||||||
|
"file_extension": doc.metadata.get("file_extension", "N/A"),
|
||||||
|
"file_size": doc.metadata.get("file_size", "N/A")
|
||||||
|
}
|
||||||
|
formatted_results.append(formatted_result)
|
||||||
|
|
||||||
|
logger.info(f"Metadata search completed, returned {len(formatted_results)} documents")
|
||||||
|
|
||||||
|
return formatted_results
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error during document search with metadata: {str(e)}")
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Example usage
|
||||||
|
query = "What is the main topic discussed in the documents?"
|
||||||
|
results = search_documents_with_metadata(query, top_k=5)
|
||||||
|
|
||||||
|
print(f"Found {len(results)} documents:")
|
||||||
|
for i, result in enumerate(results, 1):
|
||||||
|
print(f"\n{i}. Source: {result['source']}")
|
||||||
|
print(f" Filename: {result['filename']}")
|
||||||
|
print(f" Page: {result['page_number']}")
|
||||||
|
print(f" Content preview: {result['content'][:200]}...")
|
||||||
|
print(f" Metadata: {result['metadata']}")
|
||||||
@@ -4,7 +4,7 @@ import os
|
|||||||
from typing import Optional
|
from typing import Optional
|
||||||
|
|
||||||
from dotenv import load_dotenv
|
from dotenv import load_dotenv
|
||||||
from langchain_community.vectorstores import Qdrant
|
from langchain_qdrant import QdrantVectorStore
|
||||||
from langchain_core.documents import Document
|
from langchain_core.documents import Document
|
||||||
from langchain_ollama import OllamaEmbeddings
|
from langchain_ollama import OllamaEmbeddings
|
||||||
from qdrant_client import QdrantClient
|
from qdrant_client import QdrantClient
|
||||||
@@ -23,7 +23,7 @@ OLLAMA_EMBEDDING_MODEL = os.getenv("OLLAMA_EMBEDDING_MODEL", "nomic-embed-text")
|
|||||||
|
|
||||||
def initialize_vector_store(
|
def initialize_vector_store(
|
||||||
collection_name: str = "documents_langchain", recreate_collection: bool = False
|
collection_name: str = "documents_langchain", recreate_collection: bool = False
|
||||||
) -> Qdrant:
|
) -> QdrantVectorStore:
|
||||||
"""
|
"""
|
||||||
Initialize and return a Qdrant vector store with Ollama embeddings.
|
Initialize and return a Qdrant vector store with Ollama embeddings.
|
||||||
|
|
||||||
@@ -34,19 +34,18 @@ def initialize_vector_store(
|
|||||||
Returns:
|
Returns:
|
||||||
Initialized Qdrant vector store
|
Initialized Qdrant vector store
|
||||||
"""
|
"""
|
||||||
# Initialize Qdrant client
|
|
||||||
client = QdrantClient(
|
|
||||||
host=QDRANT_HOST,
|
|
||||||
port=QDRANT_REST_PORT,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Initialize Ollama embeddings
|
# Initialize Ollama embeddings
|
||||||
embeddings = OllamaEmbeddings(
|
embeddings = OllamaEmbeddings(
|
||||||
model=OLLAMA_EMBEDDING_MODEL,
|
model=OLLAMA_EMBEDDING_MODEL,
|
||||||
base_url="http://localhost:11434", # Default Ollama URL
|
base_url="http://localhost:11434", # Default Ollama URL
|
||||||
)
|
)
|
||||||
|
|
||||||
# Check if collection exists, if not create it
|
# Check if collection exists and create if needed
|
||||||
|
client = QdrantClient(
|
||||||
|
host=QDRANT_HOST,
|
||||||
|
port=QDRANT_REST_PORT,
|
||||||
|
)
|
||||||
|
|
||||||
collection_exists = False
|
collection_exists = False
|
||||||
try:
|
try:
|
||||||
client.get_collection(collection_name)
|
client.get_collection(collection_name)
|
||||||
@@ -63,7 +62,6 @@ def initialize_vector_store(
|
|||||||
if not collection_exists:
|
if not collection_exists:
|
||||||
# Create collection using the Qdrant client directly
|
# Create collection using the Qdrant client directly
|
||||||
from qdrant_client.http.models import Distance, VectorParams
|
from qdrant_client.http.models import Distance, VectorParams
|
||||||
import numpy as np
|
|
||||||
|
|
||||||
# First, we need to determine the embedding size by creating a sample embedding
|
# First, we need to determine the embedding size by creating a sample embedding
|
||||||
sample_embedding = embeddings.embed_query("sample text for dimension detection")
|
sample_embedding = embeddings.embed_query("sample text for dimension detection")
|
||||||
@@ -75,25 +73,18 @@ def initialize_vector_store(
|
|||||||
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE),
|
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Now create the Qdrant instance connected to the newly created collection
|
# Create the Qdrant instance connected to the collection
|
||||||
vector_store = Qdrant(
|
vector_store = QdrantVectorStore(
|
||||||
client=client,
|
client=client,
|
||||||
collection_name=collection_name,
|
collection_name=collection_name,
|
||||||
embeddings=embeddings,
|
embedding=embeddings,
|
||||||
)
|
|
||||||
else:
|
|
||||||
# Collection exists, just connect to it
|
|
||||||
vector_store = Qdrant(
|
|
||||||
client=client,
|
|
||||||
collection_name=collection_name,
|
|
||||||
embeddings=embeddings,
|
|
||||||
)
|
)
|
||||||
|
|
||||||
return vector_store
|
return vector_store
|
||||||
|
|
||||||
|
|
||||||
def add_documents_to_vector_store(
|
def add_documents_to_vector_store(
|
||||||
vector_store: Qdrant, documents: list[Document], batch_size: int = 10
|
vector_store: QdrantVectorStore, documents: list[Document], batch_size: int = 10
|
||||||
) -> None:
|
) -> None:
|
||||||
"""
|
"""
|
||||||
Add documents to the vector store.
|
Add documents to the vector store.
|
||||||
@@ -109,7 +100,7 @@ def add_documents_to_vector_store(
|
|||||||
vector_store.add_documents(batch)
|
vector_store.add_documents(batch)
|
||||||
|
|
||||||
|
|
||||||
def search_vector_store(vector_store: Qdrant, query: str, top_k: int = 5) -> list:
|
def search_vector_store(vector_store: QdrantVectorStore, query: str, top_k: int = 5) -> list:
|
||||||
"""
|
"""
|
||||||
Search the vector store for similar documents.
|
Search the vector store for similar documents.
|
||||||
|
|
||||||
@@ -138,7 +129,7 @@ OPENROUTER_EMBEDDING_MODEL = os.getenv("OPENROUTER_EMBEDDING_MODEL", "openai/tex
|
|||||||
|
|
||||||
def initialize_vector_store_with_openrouter(
|
def initialize_vector_store_with_openrouter(
|
||||||
collection_name: str = "documents_langchain"
|
collection_name: str = "documents_langchain"
|
||||||
) -> Qdrant:
|
) -> QdrantVectorStore:
|
||||||
# Initialize Qdrant client
|
# Initialize Qdrant client
|
||||||
client = QdrantClient(
|
client = QdrantClient(
|
||||||
host=QDRANT_HOST,
|
host=QDRANT_HOST,
|
||||||
@@ -153,7 +144,7 @@ def initialize_vector_store_with_openrouter(
|
|||||||
)
|
)
|
||||||
|
|
||||||
# Create or get the vector store
|
# Create or get the vector store
|
||||||
vector_store = Qdrant(
|
vector_store = QdrantVectorStore(
|
||||||
client=client,
|
client=client,
|
||||||
collection_name=collection_name,
|
collection_name=collection_name,
|
||||||
embeddings=embeddings,
|
embeddings=embeddings,
|
||||||
|
|||||||
Reference in New Issue
Block a user