rag-solution/services/rag/langchain/QWEN.md

# RAG Solution with Langchain

## Project Overview

This is a Retrieval-Augmented Generation (RAG) solution built using the Langchain framework. The project is designed to load documents from a data directory, store them in a vector database (Qdrant), and enable semantic search and chat capabilities using local LLMs via Ollama or OpenAI-compatible APIs.

The project follows a phased development approach with CLI entry points for different functionalities like document loading, retrieval, and chat.

### Key Technologies:
- **Framework**: Langchain
- **Vector Storage**: Qdrant
- **Embeddings**: Ollama (with fallback option for OpenAI via OpenRouter)
- **Chat Models**: Ollama and OpenAI-compatible APIs
- **Data Directory**: `./../../../data` (relative to project root)
- **Virtual Environment**: Python venv in `venv/` directory

## Project Structure

```
rag-solution/services/rag/langchain/
├── .env               # Environment variables
├── .env.dist          # Environment variable template
├── .gitignore         # Git ignore rules
├── app.py             # Main application file (currently empty)
├── cli.py             # CLI entrypoint with click library
├── EXTENSIONS.md      # Supported file extensions and LangChain loaders
├── enrichment.py      # Document enrichment module for loading documents to vector storage
├── PLANNING.md        # Development roadmap and phases
├── QWEN.md            # Current file - project context
├── requirements.txt   # Python dependencies
├── server.py          # Web server with API endpoints for the RAG agent
├── vector_storage.py  # Vector storage module with Qdrant and Ollama embeddings
└── venv/              # Virtual environment
```

## Dependencies

The project relies on several key libraries:
- `langchain` and related ecosystem (`langchain-community`, `langchain-core`, `langchain-ollama`, `langchain-openai`)
- `langgraph` for workflow management
- `qdrant-client` for vector storage (to be installed)
- `ollama` for local LLM interaction
- `click` for CLI interface
- `loguru` for logging (to be installed per requirements)
- `python-dotenv` for environment management

## Development Phases

The project is organized into 8 development phases as outlined in `PLANNING.md`:

### Phase 1: CLI Entrypoint
- [x] Virtual environment setup
- [x] Create CLI with `click` library
- [x] Implement "ping" command that outputs "pong"

### Phase 2: Framework Installation & Data Analysis
- [x] Install Langchain as base RAG framework
- [x] Analyze data folder extensions and create `EXTENSIONS.md`
- [x] Install required loader libraries
- [x] Configure environment variables

### Phase 3: Vector Storage Setup
- [x] Install Qdrant client library
- [x] Create `vector_storage.py` for initialization
- [x] Configure Ollama embeddings using `OLLAMA_EMBEDDING_MODEL`
- [x] Prepare OpenAI fallback (commented)

### Phase 4: Document Loading Module
- [x] Create `enrichment.py` for loading documents to vector storage
- [x] Implement text splitting strategies
- [x] Add document tracking to prevent re-processing
- [x] Integrate with CLI

### Phase 5: Retrieval Feature
- [x] Create `retrieval.py` for querying vector storage
- [x] Implement metadata retrieval (filename, page, section, etc.)

### Phase 6: Chat Agent
- [x] Create `agent.py` with Ollama-powered chat agent
- [x] Integrate with retrieval functionality
- [x] Add CLI command for chat interaction

### Phase 7: OpenAI Integration for Chat Model
- [x] Create OpenAI-compatible integration using `.env` variables `OPENAI_CHAT_URL` and `OPENAI_CHAT_KEY`
- [x] Make this integration optional using `.env` variable `CHAT_MODEL_STRATEGY` with "ollama" as default
- [x] Allow switching between "ollama" and "openai" strategies

### Phase 8: HTTP Endpoint
- [x] Create web framework with POST endpoint `/api/test-query` for agent queries
- [x] Implement server using FastAPI and LangServe
- [x] Add request/response validation with Pydantic models
- [x] Include CORS middleware for cross-origin requests
- [x] Add health check endpoint

## Environment Configuration

The project uses environment variables for configuration:

```env
OLLAMA_EMBEDDING_MODEL=MODEL      # Name of the Ollama model for embeddings
OLLAMA_CHAT_MODEL=MODEL           # Name of the Ollama model for chat
OPENAI_CHAT_URL=URL               # OpenAI-compatible API URL
OPENAI_CHAT_KEY=KEY               # Authorization token for OpenAI-compatible API
OPENAI_CHAT_MODEL=MODEL           # Name of the OpenAI-compatible model to use
CHAT_MODEL_STRATEGY=ollama        # Strategy to use: "ollama" (default) or "openai"
```

## Building and Running

Since the project is in early development stages, the following steps are planned:

1. **Setup Virtual Environment**:
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   pip install -r requirements.txt
   ```

2. **Install Missing Dependencies** (as development progresses):
   ```bash
   pip install loguru qdrant-client  # Examples of needed libraries
   ```

3. **Configure Environment**:
   ```bash
   cp .env.dist .env
   # Edit .env with appropriate values
   ```

4. **Run CLI Commands**:
   ```bash
   python cli.py ping
   ```

## Development Conventions

- Use `loguru` for logging with rotation to `logs/dev.log` and stdout
- Follow Langchain best practices for RAG implementations
- Prioritize open-source solutions that don't require external API keys
- Implement proper error handling and document processing tracking
- Use modular code organization with separate files for different components

## Current Status

The project is in early development phase. The virtual environment is set up and dependencies are defined, but the core functionality (CLI, document loading, vector storage, etc.) is yet to be implemented according to the planned phases.

## Important Implementation Notes

### OCR and Computer Vision Setup
- Added Tesseract OCR support for image text extraction
- Installed `pytesseract` and `unstructured-pytesseract` packages
- Configured image loaders to use OCR strategy ("ocr_only") for extracting text from images
- This resolves the "OCRAgent instance" error when processing image files

### Russian Language Processing Configuration
- Installed spaCy library and Russian language model (`ru_core_news_sm`)
- Configured unstructured loaders to use Russian as the primary language (`"languages": ["rus"]`)
- This improves processing accuracy for Russian documents in the dataset

### Qdrant Collection Auto-Creation Fix
- Fixed issue where Qdrant collections were not being created automatically
- Implemented logic to check if collection exists and create it if needed
- Uses Qdrant client's `create_collection` method with proper vector size detection
- Resolves the "Collection doesn't exist" 404 error during document insertion

### Document Tracking Improvement
- Modified document tracking to only mark documents as processed after successful vector storage insertion
- This prevents documents from being marked as processed if vector storage insertion fails

### Dependency Management
- Added several new dependencies for enhanced functionality:
  - `pdf2image` for PDF-to-image conversion
  - `unstructured-inference` for advanced document analysis
  - `python-pptx` for PowerPoint processing
  - `pi-heif` for HEIF image format support
  - `spacy` and `ru-core-news-sm` for Russian NLP

### Error Handling Improvements
- Enhanced error handling for optional dependencies in document loaders
- Added graceful degradation when optional modules are not available

### Phase 5 Implementation Notes
- Created `retrieval.py` module with LangChain Retriever functionality
- Implemented search functions that retrieve documents with metadata from Qdrant vector storage
- Added CLI command `retrieve` to search the vector database based on a query
- Retrieval returns documents with metadata including source, filename, page number, file extension, etc.
- Used QdrantVectorStore from langchain-qdrant package for compatibility with newer LangChain versions

### Phase 6 Implementation Notes
- Created `agent.py` module with Ollama-powered chat agent using LangGraph
- Integrated the agent with retrieval functionality to provide context-aware responses
- Added CLI command `chat` for interactive conversation with the RAG agent
- Agent uses document retrieval tool to fetch relevant information based on user queries
- Implemented proper error handling and conversation history management

### Phase 7 Implementation Notes
- Enhanced `agent.py` to support both Ollama and OpenAI-compatible chat models
- Added conditional logic to select chat model based on `CHAT_MODEL_STRATEGY` environment variable
- When strategy is "openai", uses `ChatOpenAI` with `OPENAI_CHAT_URL` and `OPENAI_CHAT_KEY` from environment
- When strategy is "ollama" (default), uses existing `ChatOllama` implementation
- Updated CLI chat command to show which model strategy is being used

### Phase 8 Implementation Notes
- Created `server.py` with FastAPI and integrated with existing agent functionality
- Implemented `/api/test-query` POST endpoint that accepts JSON with "query" field
- Added request/response validation using Pydantic models
- Included CORS middleware to support cross-origin requests
- Added health check endpoint at root path
- Server runs on port 8000 by default
- Supports both Ollama and OpenAI strategies through existing configuration

### Issue Fix Notes
- Fixed DocumentRetrievalTool class to properly declare and initialize the retriever field
- Resolved Pydantic field declaration issue that caused "object has no field" error
- Ensured proper initialization sequence for the retriever within the tool class

### Troubleshooting Notes
- If encountering "No module named 'unstructured_inference'" error, install unstructured-inference
- If seeing OCR-related errors, ensure tesseract is installed at the system level and unstructured-pytesseract is available
- For language detection issues, verify that appropriate spaCy models are downloaded
langchain done cli 2026-02-03 19:51:35 +03:00			`# RAG Solution with Langchain`

			`## Project Overview`

openai compatible integration done 2026-02-04 22:30:57 +03:00			`This is a Retrieval-Augmented Generation (RAG) solution built using the Langchain framework. The project is designed to load documents from a data directory, store them in a vector database (Qdrant), and enable semantic search and chat capabilities using local LLMs via Ollama or OpenAI-compatible APIs.`
langchain done cli 2026-02-03 19:51:35 +03:00
			`The project follows a phased development approach with CLI entry points for different functionalities like document loading, retrieval, and chat.`

			`### Key Technologies:`
			`- Framework: Langchain`
			`- Vector Storage: Qdrant`
			`- Embeddings: Ollama (with fallback option for OpenAI via OpenRouter)`
openai compatible integration done 2026-02-04 22:30:57 +03:00			`- Chat Models: Ollama and OpenAI-compatible APIs`
langchain done cli 2026-02-03 19:51:35 +03:00			- Data Directory: `./../../../data` (relative to project root)
			- Virtual Environment: Python venv in `venv/` directory

			`## Project Structure`

			```
			`rag-solution/services/rag/langchain/`
langchain extensions for data files and their libraries 2026-02-03 20:17:13 +03:00			`├── .env # Environment variables`
langchain done cli 2026-02-03 19:51:35 +03:00			`├── .env.dist # Environment variable template`
			`├── .gitignore # Git ignore rules`
			`├── app.py # Main application file (currently empty)`
			`├── cli.py # CLI entrypoint with click library`
langchain extensions for data files and their libraries 2026-02-03 20:17:13 +03:00			`├── EXTENSIONS.md # Supported file extensions and LangChain loaders`
langchain loading documents into vector storage 2026-02-03 20:52:08 +03:00			`├── enrichment.py # Document enrichment module for loading documents to vector storage`
langchain done cli 2026-02-03 19:51:35 +03:00			`├── PLANNING.md # Development roadmap and phases`
			`├── QWEN.md # Current file - project context`
			`├── requirements.txt # Python dependencies`
preparations for demo html page 2026-02-04 22:50:24 +03:00			`├── server.py # Web server with API endpoints for the RAG agent`
langchain vector storage connection and confguration 2026-02-03 20:42:09 +03:00			`├── vector_storage.py # Vector storage module with Qdrant and Ollama embeddings`
langchain done cli 2026-02-03 19:51:35 +03:00			`└── venv/ # Virtual environment`
			```

			`## Dependencies`

			`The project relies on several key libraries:`
openai compatible integration done 2026-02-04 22:30:57 +03:00			- `langchain` and related ecosystem (`langchain-community`, `langchain-core`, `langchain-ollama`, `langchain-openai`)
langchain done cli 2026-02-03 19:51:35 +03:00			- `langgraph` for workflow management
			- `qdrant-client` for vector storage (to be installed)
			- `ollama` for local LLM interaction
			- `click` for CLI interface
			- `loguru` for logging (to be installed per requirements)
			- `python-dotenv` for environment management

			`## Development Phases`

openai compatible integration done 2026-02-04 22:30:57 +03:00			The project is organized into 8 development phases as outlined in `PLANNING.md`:
langchain done cli 2026-02-03 19:51:35 +03:00
			`### Phase 1: CLI Entrypoint`
			`- [x] Virtual environment setup`
			- [x] Create CLI with `click` library
			`- [x] Implement "ping" command that outputs "pong"`

			`### Phase 2: Framework Installation & Data Analysis`
langchain extensions for data files and their libraries 2026-02-03 20:17:13 +03:00			`- [x] Install Langchain as base RAG framework`
			- [x] Analyze data folder extensions and create `EXTENSIONS.md`
			`- [x] Install required loader libraries`
			`- [x] Configure environment variables`
langchain done cli 2026-02-03 19:51:35 +03:00
			`### Phase 3: Vector Storage Setup`
langchain vector storage connection and confguration 2026-02-03 20:42:09 +03:00			`- [x] Install Qdrant client library`
			- [x] Create `vector_storage.py` for initialization
			- [x] Configure Ollama embeddings using `OLLAMA_EMBEDDING_MODEL`
			`- [x] Prepare OpenAI fallback (commented)`
langchain done cli 2026-02-03 19:51:35 +03:00
			`### Phase 4: Document Loading Module`
langchain loading documents into vector storage 2026-02-03 20:52:08 +03:00			- [x] Create `enrichment.py` for loading documents to vector storage
			`- [x] Implement text splitting strategies`
			`- [x] Add document tracking to prevent re-processing`
			`- [x] Integrate with CLI`
langchain done cli 2026-02-03 19:51:35 +03:00
			`### Phase 5: Retrieval Feature`
Working retrieval with the cli 2026-02-03 23:25:24 +03:00			- [x] Create `retrieval.py` for querying vector storage
			`- [x] Implement metadata retrieval (filename, page, section, etc.)`
langchain done cli 2026-02-03 19:51:35 +03:00
			`### Phase 6: Chat Agent`
Working chat with AI agent with retrieving data 2026-02-04 00:02:53 +03:00			- [x] Create `agent.py` with Ollama-powered chat agent
			`- [x] Integrate with retrieval functionality`
			`- [x] Add CLI command for chat interaction`
langchain done cli 2026-02-03 19:51:35 +03:00
openai compatible integration done 2026-02-04 22:30:57 +03:00			`### Phase 7: OpenAI Integration for Chat Model`
			- [x] Create OpenAI-compatible integration using `.env` variables `OPENAI_CHAT_URL` and `OPENAI_CHAT_KEY`
			- [x] Make this integration optional using `.env` variable `CHAT_MODEL_STRATEGY` with "ollama" as default
			`- [x] Allow switching between "ollama" and "openai" strategies`

			`### Phase 8: HTTP Endpoint`
preparations for demo html page 2026-02-04 22:50:24 +03:00			- [x] Create web framework with POST endpoint `/api/test-query` for agent queries
			`- [x] Implement server using FastAPI and LangServe`
			`- [x] Add request/response validation with Pydantic models`
			`- [x] Include CORS middleware for cross-origin requests`
			`- [x] Add health check endpoint`
openai compatible integration done 2026-02-04 22:30:57 +03:00
langchain done cli 2026-02-03 19:51:35 +03:00			`## Environment Configuration`

			`The project uses environment variables for configuration:`

			```env
openai compatible integration done 2026-02-04 22:30:57 +03:00			`OLLAMA_EMBEDDING_MODEL=MODEL # Name of the Ollama model for embeddings`
			`OLLAMA_CHAT_MODEL=MODEL # Name of the Ollama model for chat`
			`OPENAI_CHAT_URL=URL # OpenAI-compatible API URL`
			`OPENAI_CHAT_KEY=KEY # Authorization token for OpenAI-compatible API`
			`OPENAI_CHAT_MODEL=MODEL # Name of the OpenAI-compatible model to use`
			`CHAT_MODEL_STRATEGY=ollama # Strategy to use: "ollama" (default) or "openai"`
langchain done cli 2026-02-03 19:51:35 +03:00			```

			`## Building and Running`

			`Since the project is in early development stages, the following steps are planned:`

			`1. Setup Virtual Environment:`
			```bash
			`python -m venv venv`
			`source venv/bin/activate # On Windows: venv\Scripts\activate`
			`pip install -r requirements.txt`
			```

			`2. Install Missing Dependencies (as development progresses):`
			```bash
			`pip install loguru qdrant-client # Examples of needed libraries`
			```

			`3. Configure Environment:`
			```bash
			`cp .env.dist .env`
			`# Edit .env with appropriate values`
			```

			`4. Run CLI Commands:`
			```bash
			`python cli.py ping`
			```

			`## Development Conventions`

			- Use `loguru` for logging with rotation to `logs/dev.log` and stdout
			`- Follow Langchain best practices for RAG implementations`
			`- Prioritize open-source solutions that don't require external API keys`
			`- Implement proper error handling and document processing tracking`
			`- Use modular code organization with separate files for different components`

			`## Current Status`

Working retrieval with the cli 2026-02-03 23:25:24 +03:00			`The project is in early development phase. The virtual environment is set up and dependencies are defined, but the core functionality (CLI, document loading, vector storage, etc.) is yet to be implemented according to the planned phases.`

			`## Important Implementation Notes`

			`### OCR and Computer Vision Setup`
			`- Added Tesseract OCR support for image text extraction`
			- Installed `pytesseract` and `unstructured-pytesseract` packages
			`- Configured image loaders to use OCR strategy ("ocr_only") for extracting text from images`
			`- This resolves the "OCRAgent instance" error when processing image files`

			`### Russian Language Processing Configuration`
			- Installed spaCy library and Russian language model (`ru_core_news_sm`)
			- Configured unstructured loaders to use Russian as the primary language (`"languages": ["rus"]`)
			`- This improves processing accuracy for Russian documents in the dataset`

			`### Qdrant Collection Auto-Creation Fix`
			`- Fixed issue where Qdrant collections were not being created automatically`
			`- Implemented logic to check if collection exists and create it if needed`
			- Uses Qdrant client's `create_collection` method with proper vector size detection
			`- Resolves the "Collection doesn't exist" 404 error during document insertion`

			`### Document Tracking Improvement`
			`- Modified document tracking to only mark documents as processed after successful vector storage insertion`
			`- This prevents documents from being marked as processed if vector storage insertion fails`

			`### Dependency Management`
			`- Added several new dependencies for enhanced functionality:`
			- `pdf2image` for PDF-to-image conversion
			- `unstructured-inference` for advanced document analysis
			- `python-pptx` for PowerPoint processing
			- `pi-heif` for HEIF image format support
			- `spacy` and `ru-core-news-sm` for Russian NLP

			`### Error Handling Improvements`
			`- Enhanced error handling for optional dependencies in document loaders`
			`- Added graceful degradation when optional modules are not available`

			`### Phase 5 Implementation Notes`
			- Created `retrieval.py` module with LangChain Retriever functionality
			`- Implemented search functions that retrieve documents with metadata from Qdrant vector storage`
			- Added CLI command `retrieve` to search the vector database based on a query
			`- Retrieval returns documents with metadata including source, filename, page number, file extension, etc.`
			`- Used QdrantVectorStore from langchain-qdrant package for compatibility with newer LangChain versions`

Working chat with AI agent with retrieving data 2026-02-04 00:02:53 +03:00			`### Phase 6 Implementation Notes`
			- Created `agent.py` module with Ollama-powered chat agent using LangGraph
			`- Integrated the agent with retrieval functionality to provide context-aware responses`
			- Added CLI command `chat` for interactive conversation with the RAG agent
			`- Agent uses document retrieval tool to fetch relevant information based on user queries`
			`- Implemented proper error handling and conversation history management`

openai compatible integration done 2026-02-04 22:30:57 +03:00			`### Phase 7 Implementation Notes`
			- Enhanced `agent.py` to support both Ollama and OpenAI-compatible chat models
			- Added conditional logic to select chat model based on `CHAT_MODEL_STRATEGY` environment variable
			- When strategy is "openai", uses `ChatOpenAI` with `OPENAI_CHAT_URL` and `OPENAI_CHAT_KEY` from environment
			- When strategy is "ollama" (default), uses existing `ChatOllama` implementation
			`- Updated CLI chat command to show which model strategy is being used`

preparations for demo html page 2026-02-04 22:50:24 +03:00			`### Phase 8 Implementation Notes`
			- Created `server.py` with FastAPI and integrated with existing agent functionality
			- Implemented `/api/test-query` POST endpoint that accepts JSON with "query" field
			`- Added request/response validation using Pydantic models`
			`- Included CORS middleware to support cross-origin requests`
			`- Added health check endpoint at root path`
			`- Server runs on port 8000 by default`
			`- Supports both Ollama and OpenAI strategies through existing configuration`

Working chat with AI agent with retrieving data 2026-02-04 00:02:53 +03:00			`### Issue Fix Notes`
			`- Fixed DocumentRetrievalTool class to properly declare and initialize the retriever field`
			`- Resolved Pydantic field declaration issue that caused "object has no field" error`
			`- Ensured proper initialization sequence for the retriever within the tool class`

Working retrieval with the cli 2026-02-03 23:25:24 +03:00			`### Troubleshooting Notes`
			`- If encountering "No module named 'unstructured_inference'" error, install unstructured-inference`
			`- If seeing OCR-related errors, ensure tesseract is installed at the system level and unstructured-pytesseract is available`
			`- For language detection issues, verify that appropriate spaCy models are downloaded`