Working retrieval with the cli

2026-02-03 23:25:24 +03:00
parent 4cbd5313d2
commit 299ee0acb5
5 changed files with 274 additions and 31 deletions
--- a/services/rag/langchain/QWEN.md
+++ b/services/rag/langchain/QWEN.md
@@ -71,8 +71,8 @@ The project is organized into 6 development phases as outlined in `PLANNING.md`:
 - [x] Integrate with CLI

 ### Phase 5: Retrieval Feature
- [ ] Create `retrieval.py` for querying vector storage
- [ ] Implement metadata retrieval (filename, page, section, etc.)
+- [x] Create `retrieval.py` for querying vector storage
+- [x] Implement metadata retrieval (filename, page, section, etc.)

 ### Phase 6: Chat Agent
 - [ ] Create `agent.py` with Ollama-powered chat agent
@@ -125,4 +125,51 @@ Since the project is in early development stages, the following steps are planne

 ## Current Status

-The project is in early development phase. The virtual environment is set up and dependencies are defined, but the core functionality (CLI, document loading, vector storage, etc.) is yet to be implemented according to the planned phases.
+The project is in early development phase. The virtual environment is set up and dependencies are defined, but the core functionality (CLI, document loading, vector storage, etc.) is yet to be implemented according to the planned phases.
+
+## Important Implementation Notes
+
+### OCR and Computer Vision Setup
+- Added Tesseract OCR support for image text extraction
+- Installed `pytesseract` and `unstructured-pytesseract` packages
+- Configured image loaders to use OCR strategy ("ocr_only") for extracting text from images
+- This resolves the "OCRAgent instance" error when processing image files
+
+### Russian Language Processing Configuration
+- Installed spaCy library and Russian language model (`ru_core_news_sm`)
+- Configured unstructured loaders to use Russian as the primary language (`"languages": ["rus"]`)
+- This improves processing accuracy for Russian documents in the dataset
+
+### Qdrant Collection Auto-Creation Fix
+- Fixed issue where Qdrant collections were not being created automatically
+- Implemented logic to check if collection exists and create it if needed
+- Uses Qdrant client's `create_collection` method with proper vector size detection
+- Resolves the "Collection doesn't exist" 404 error during document insertion
+
+### Document Tracking Improvement
+- Modified document tracking to only mark documents as processed after successful vector storage insertion
+- This prevents documents from being marked as processed if vector storage insertion fails
+
+### Dependency Management
+- Added several new dependencies for enhanced functionality:
+  - `pdf2image` for PDF-to-image conversion
+  - `unstructured-inference` for advanced document analysis
+  - `python-pptx` for PowerPoint processing
+  - `pi-heif` for HEIF image format support
+  - `spacy` and `ru-core-news-sm` for Russian NLP
+
+### Error Handling Improvements
+- Enhanced error handling for optional dependencies in document loaders
+- Added graceful degradation when optional modules are not available
+
+### Phase 5 Implementation Notes
+- Created `retrieval.py` module with LangChain Retriever functionality
+- Implemented search functions that retrieve documents with metadata from Qdrant vector storage
+- Added CLI command `retrieve` to search the vector database based on a query
+- Retrieval returns documents with metadata including source, filename, page number, file extension, etc.
+- Used QdrantVectorStore from langchain-qdrant package for compatibility with newer LangChain versions
+
+### Troubleshooting Notes
+- If encountering "No module named 'unstructured_inference'" error, install unstructured-inference
+- If seeing OCR-related errors, ensure tesseract is installed at the system level and unstructured-pytesseract is available
+- For language detection issues, verify that appropriate spaCy models are downloaded