llamaindex update + unpacking archives in data

This commit is contained in:
2026-02-09 19:00:23 +03:00
parent 0adbc29692
commit f9c47c772f
11 changed files with 478 additions and 100 deletions

View File

@@ -16,6 +16,7 @@ The system has been enhanced to properly handle Russian language documents with
### Architecture Components
- CLI entry point (`cli.py`)
- Configuration module (`config.py`) - manages model strategies and environment variables
- Document enrichment module (`enrichment.py`)
- Vector storage configuration (`vector_storage.py`)
- Retrieval module (`retrieval.py`)
@@ -57,9 +58,15 @@ The system has been enhanced to properly handle Russian language documents with
- Use appropriate log levels (DEBUG, INFO, WARNING, ERROR)
### Environment Variables
- `CHAT_STRATEGY`: Strategy for chat models ("ollama" or "openai")
- `EMBEDDING_STRATEGY`: Strategy for embedding models ("ollama" or "openai")
- `OLLAMA_EMBEDDING_MODEL`: Name of the Ollama model to use for embeddings
- `OLLAMA_CHAT_MODEL`: Name of the Ollama model to use for chat functionality
- API keys for external services (OpenRouter option available but commented out)
- `OPENAI_CHAT_URL`: URL for OpenAI-compatible chat API (when using OpenAI strategy)
- `OPENAI_CHAT_KEY`: API key for OpenAI-compatible chat API (when using OpenAI strategy)
- `OPENAI_EMBEDDING_MODEL`: Name of the OpenAI embedding model (when using OpenAI strategy)
- `OPENAI_EMBEDDING_BASE_URL`: Base URL for OpenAI-compatible embedding API (when using OpenAI strategy)
- `OPENAI_EMBEDDING_API_KEY`: API key for OpenAI-compatible embedding API (when using OpenAI strategy)
### Document Processing
- Support multiple file formats based on EXTENSIONS.md
@@ -105,7 +112,19 @@ The system has been enhanced to properly handle Russian language documents with
- [x] Query processing with metadata retrieval
- [x] Russian language/Cyrillic text encoding support
### Phase 6: Chat Agent
### Phase 6: Model Strategy
- [x] Add `CHAT_STRATEGY` and `EMBEDDING_STRATEGY` environment variables
- [x] Add OpenAI configuration options to .env files
- [x] Create reusable model configuration function
- [x] Update all modules to use the new configuration system
- [x] Ensure proper .env loading across all modules
### Phase 7: Enhanced Logging and Progress Tracking
- [x] Added progress bar using tqdm to show processing progress
- [x] Added logging to show total files and processed count during document enrichment
- [x] Enhanced user feedback during document processing with percentage and counts
### Phase 8: Chat Agent
- [ ] Agent module with Ollama integration
- [ ] Integration with retrieval module
- [ ] CLI command for chat functionality
@@ -115,9 +134,10 @@ The system has been enhanced to properly handle Russian language documents with
llamaindex/
├── venv/ # Python virtual environment
├── cli.py # CLI entry point
├── config.py # Configuration module for model strategies
├── vector_storage.py # Vector storage configuration
├── enrichment.py # Document loading and processing (to be created)
├── retrieval.py # Search and retrieval functionality (to be created)
├── enrichment.py # Document loading and processing
├── retrieval.py # Search and retrieval functionality
├── agent.py # Chat agent implementation (to be created)
├── EXTENSIONS.md # Supported file extensions and loaders
├── .env.dist # Environment variable template
@@ -140,4 +160,8 @@ The system expects documents to be placed in `./../../../data` relative to the p
- Verify Qdrant is accessible on ports 6333 (REST) and 6334 (gRPC)
- Check that the data directory contains supported file types
- Review logs in `logs/dev.log` for detailed error information
- For Russian/Cyrillic text issues, ensure proper encoding handling is configured in both enrichment and retrieval modules
- For Russian/Cyrillic text issues, ensure proper encoding handling is configured in both enrichment and retrieval modules
## Important Notes
- Do not test long-running or heavy system scripts during development as they can consume significant system resources and take hours to complete
- The enrich command processes all files in the data directory and may require substantial memory and processing time