Files

idchlife 9188b672c2 preparations for demo html page

2026-02-04 22:50:24 +03:00

9.9 KiB

Raw Blame History

RAG Solution with Langchain

Project Overview

This is a Retrieval-Augmented Generation (RAG) solution built using the Langchain framework. The project is designed to load documents from a data directory, store them in a vector database (Qdrant), and enable semantic search and chat capabilities using local LLMs via Ollama or OpenAI-compatible APIs.

The project follows a phased development approach with CLI entry points for different functionalities like document loading, retrieval, and chat.

Key Technologies:

Framework: Langchain
Vector Storage: Qdrant
Embeddings: Ollama (with fallback option for OpenAI via OpenRouter)
Chat Models: Ollama and OpenAI-compatible APIs
Data Directory: ./../../../data (relative to project root)
Virtual Environment: Python venv in venv/ directory

Project Structure

rag-solution/services/rag/langchain/
├── .env               # Environment variables
├── .env.dist          # Environment variable template
├── .gitignore         # Git ignore rules
├── app.py             # Main application file (currently empty)
├── cli.py             # CLI entrypoint with click library
├── EXTENSIONS.md      # Supported file extensions and LangChain loaders
├── enrichment.py      # Document enrichment module for loading documents to vector storage
├── PLANNING.md        # Development roadmap and phases
├── QWEN.md            # Current file - project context
├── requirements.txt   # Python dependencies
├── server.py          # Web server with API endpoints for the RAG agent
├── vector_storage.py  # Vector storage module with Qdrant and Ollama embeddings
└── venv/              # Virtual environment

Dependencies

The project relies on several key libraries:

langchain and related ecosystem (langchain-community, langchain-core, langchain-ollama, langchain-openai)
langgraph for workflow management
qdrant-client for vector storage (to be installed)
ollama for local LLM interaction
click for CLI interface
loguru for logging (to be installed per requirements)
python-dotenv for environment management

Development Phases

The project is organized into 8 development phases as outlined in PLANNING.md:

Phase 1: CLI Entrypoint

Virtual environment setup
Create CLI with click library
Implement "ping" command that outputs "pong"

Phase 2: Framework Installation & Data Analysis

Install Langchain as base RAG framework
Analyze data folder extensions and create EXTENSIONS.md
Install required loader libraries
Configure environment variables

Phase 3: Vector Storage Setup

Install Qdrant client library
Create vector_storage.py for initialization
Configure Ollama embeddings using OLLAMA_EMBEDDING_MODEL
Prepare OpenAI fallback (commented)

Phase 4: Document Loading Module

Create enrichment.py for loading documents to vector storage
Implement text splitting strategies
Add document tracking to prevent re-processing
Integrate with CLI

Phase 5: Retrieval Feature

Create retrieval.py for querying vector storage
Implement metadata retrieval (filename, page, section, etc.)

Phase 6: Chat Agent

Create agent.py with Ollama-powered chat agent
Integrate with retrieval functionality
Add CLI command for chat interaction

Phase 7: OpenAI Integration for Chat Model

Create OpenAI-compatible integration using .env variables OPENAI_CHAT_URL and OPENAI_CHAT_KEY
Make this integration optional using .env variable CHAT_MODEL_STRATEGY with "ollama" as default
Allow switching between "ollama" and "openai" strategies

Phase 8: HTTP Endpoint

Create web framework with POST endpoint /api/test-query for agent queries
Implement server using FastAPI and LangServe
Add request/response validation with Pydantic models
Include CORS middleware for cross-origin requests
Add health check endpoint

Environment Configuration

The project uses environment variables for configuration:

OLLAMA_EMBEDDING_MODEL=MODEL      # Name of the Ollama model for embeddings
OLLAMA_CHAT_MODEL=MODEL           # Name of the Ollama model for chat
OPENAI_CHAT_URL=URL               # OpenAI-compatible API URL
OPENAI_CHAT_KEY=KEY               # Authorization token for OpenAI-compatible API
OPENAI_CHAT_MODEL=MODEL           # Name of the OpenAI-compatible model to use
CHAT_MODEL_STRATEGY=ollama        # Strategy to use: "ollama" (default) or "openai"

Building and Running

Since the project is in early development stages, the following steps are planned:

Setup Virtual Environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Install Missing Dependencies (as development progresses):

pip install loguru qdrant-client  # Examples of needed libraries

Configure Environment:

cp .env.dist .env
# Edit .env with appropriate values

Run CLI Commands:
```
python cli.py ping
```

Development Conventions

Use loguru for logging with rotation to logs/dev.log and stdout
Follow Langchain best practices for RAG implementations
Prioritize open-source solutions that don't require external API keys
Implement proper error handling and document processing tracking
Use modular code organization with separate files for different components

Current Status

The project is in early development phase. The virtual environment is set up and dependencies are defined, but the core functionality (CLI, document loading, vector storage, etc.) is yet to be implemented according to the planned phases.

Important Implementation Notes

OCR and Computer Vision Setup

Added Tesseract OCR support for image text extraction
Installed pytesseract and unstructured-pytesseract packages
Configured image loaders to use OCR strategy ("ocr_only") for extracting text from images
This resolves the "OCRAgent instance" error when processing image files

Russian Language Processing Configuration

Installed spaCy library and Russian language model (ru_core_news_sm)
Configured unstructured loaders to use Russian as the primary language ("languages": ["rus"])
This improves processing accuracy for Russian documents in the dataset

Qdrant Collection Auto-Creation Fix

Fixed issue where Qdrant collections were not being created automatically
Implemented logic to check if collection exists and create it if needed
Uses Qdrant client's create_collection method with proper vector size detection
Resolves the "Collection doesn't exist" 404 error during document insertion

Document Tracking Improvement

Modified document tracking to only mark documents as processed after successful vector storage insertion
This prevents documents from being marked as processed if vector storage insertion fails

Dependency Management

Added several new dependencies for enhanced functionality:
- pdf2image for PDF-to-image conversion
- unstructured-inference for advanced document analysis
- python-pptx for PowerPoint processing
- pi-heif for HEIF image format support
- spacy and ru-core-news-sm for Russian NLP

Error Handling Improvements

Enhanced error handling for optional dependencies in document loaders
Added graceful degradation when optional modules are not available

Phase 5 Implementation Notes

Created retrieval.py module with LangChain Retriever functionality
Implemented search functions that retrieve documents with metadata from Qdrant vector storage
Added CLI command retrieve to search the vector database based on a query
Retrieval returns documents with metadata including source, filename, page number, file extension, etc.
Used QdrantVectorStore from langchain-qdrant package for compatibility with newer LangChain versions

Phase 6 Implementation Notes

Created agent.py module with Ollama-powered chat agent using LangGraph
Integrated the agent with retrieval functionality to provide context-aware responses
Added CLI command chat for interactive conversation with the RAG agent
Agent uses document retrieval tool to fetch relevant information based on user queries
Implemented proper error handling and conversation history management

Phase 7 Implementation Notes

Enhanced agent.py to support both Ollama and OpenAI-compatible chat models
Added conditional logic to select chat model based on CHAT_MODEL_STRATEGY environment variable
When strategy is "openai", uses ChatOpenAI with OPENAI_CHAT_URL and OPENAI_CHAT_KEY from environment
When strategy is "ollama" (default), uses existing ChatOllama implementation
Updated CLI chat command to show which model strategy is being used

Phase 8 Implementation Notes

Created server.py with FastAPI and integrated with existing agent functionality
Implemented /api/test-query POST endpoint that accepts JSON with "query" field
Added request/response validation using Pydantic models
Included CORS middleware to support cross-origin requests
Added health check endpoint at root path
Server runs on port 8000 by default
Supports both Ollama and OpenAI strategies through existing configuration

Issue Fix Notes

Fixed DocumentRetrievalTool class to properly declare and initialize the retriever field
Resolved Pydantic field declaration issue that caused "object has no field" error
Ensured proper initialization sequence for the retriever within the tool class

Troubleshooting Notes

If encountering "No module named 'unstructured_inference'" error, install unstructured-inference
If seeing OCR-related errors, ensure tesseract is installed at the system level and unstructured-pytesseract is available
For language detection issues, verify that appropriate spaCy models are downloaded

9.9 KiB Raw Blame History