Retrieval and also update on russian language

This commit is contained in:
2026-02-04 16:51:50 +03:00
parent 3dea3605ad
commit ea4ce23cd9
7 changed files with 572 additions and 17 deletions

View File

@@ -4,6 +4,8 @@
This is a Retrieval Augmented Generation (RAG) solution built using LlamaIndex as the primary framework and Qdrant as the vector storage. The project is designed to load documents from a shared data directory, store them in a vector database, and enable semantic search and chat capabilities using local Ollama models.
The system has been enhanced to properly handle Russian language documents with Cyrillic characters, ensuring proper encoding during document loading, storage, and retrieval.
### Key Technologies
- **RAG Framework**: LlamaIndex
- **Vector Storage**: Qdrant
@@ -64,6 +66,7 @@ This is a Retrieval Augmented Generation (RAG) solution built using LlamaIndex a
- Use text splitters appropriate for each document type
- Store metadata (filename, page, section, paragraph) with embeddings
- Track processed documents to avoid re-processing (using SQLite if needed)
- Proper encoding handling for Russian/Cyrillic text during loading and retrieval
### Vector Storage
- Collection name: "documents_llamaindex"
@@ -95,10 +98,12 @@ This is a Retrieval Augmented Generation (RAG) solution built using LlamaIndex a
- [x] Text splitting strategies implementation
- [x] Document tracking mechanism
- [x] CLI command for enrichment
- [x] Russian language/Cyrillic text encoding support during document loading
### Phase 5: Retrieval Feature
- [ ] Retrieval module configuration
- [ ] Query processing with metadata retrieval
- [x] Retrieval module configuration
- [x] Query processing with metadata retrieval
- [x] Russian language/Cyrillic text encoding support
### Phase 6: Chat Agent
- [ ] Agent module with Ollama integration
@@ -134,4 +139,5 @@ The system expects documents to be placed in `./../../../data` relative to the p
- Ensure Ollama is running on port 11434
- Verify Qdrant is accessible on ports 6333 (REST) and 6334 (gRPC)
- Check that the data directory contains supported file types
- Review logs in `logs/dev.log` for detailed error information
- Review logs in `logs/dev.log` for detailed error information
- For Russian/Cyrillic text issues, ensure proper encoding handling is configured in both enrichment and retrieval modules