langchain extensions for data files and their libraries

2026-02-03 20:17:13 +03:00
parent d99433d087
commit cd7c96e022
3 changed files with 96 additions and 9 deletions
--- a/services/rag/langchain/EXTENSIONS.md
+++ b/services/rag/langchain/EXTENSIONS.md
@@ -0,0 +1,85 @@
+# Supported File Extensions and LangChain Loaders
+
+This document lists the file extensions found in the data directory and the recommended LangChain loaders for processing them. Only extensions that can be processed with open-source solutions (without requiring external API keys) are included.
+
+## Document Types
+
+| Extension | Count | LangChain Loader | Required Library | Notes |
+|-----------|-------|------------------|------------------|-------|
+| docx | 209 | UnstructuredDocxLoader | unstructured, python-docx | Microsoft Word documents |
+| pdf | 98 | PyPDFLoader or PDFPlumberLoader | pypdf or pdfplumber | PDF documents |
+| pptx | 12 | UnstructuredPowerPointLoader | unstructured | Microsoft PowerPoint presentations |
+| xlsx | 2 | PandasExcelLoader | pandas, openpyxl | Microsoft Excel spreadsheets |
+| odt | 2 | UnstructuredODTLoader | unstructured | OpenDocument Text files |
+
+## Image Types
+
+| Extension | Count | LangChain Loader | Required Library | Notes |
+|-----------|-------|------------------|------------------|-------|
+| jpg | 23 | UnstructuredImageLoader | unstructured, pillow | JPEG images |
+| png | 2 | UnstructuredImageLoader | unstructured, pillow | PNG images |
+
+## Excluded File Types
+
+The following file types were found in the data directory but are excluded from processing as per requirements:
+
+- Audio files: mp4, m4a, ogg, mp3
+- System files: gitignore, DS_Store, zip
+
+## Recommended Loaders Summary
+
+### For Documents
+- **Unstructured loaders**: Best for extracting text from various document formats (docx, pptx, odt)
+  - Libraries needed: `unstructured`, `python-docx`, `pillow`
+  - Advantages: Handles formatting, tables, images within documents
+
+- **PyPDFLoader**: Good for PDF files with text content
+  - Library needed: `pypdf`
+  - Alternative: PDFPlumberLoader for better accuracy with complex layouts
+
+- **PandasExcelLoader**: For Excel spreadsheets
+  - Libraries needed: `pandas`, `openpyxl`
+  - Can handle multiple sheets and complex data structures
+
+### Installation Commands
+
+```bash
+# Install unstructured and its dependencies
+pip install unstructured python-docx pillow
+
+# Install PDF processing libraries
+pip install pypdf pdfplumber
+
+# Install Excel processing libraries
+pip install pandas openpyxl
+
+# Install additional dependencies for unstructured
+pip install "unstructured[all-docs]"
+```
+
+## Usage Example
+
+```python
+from langchain_community.document_loaders import (
+    PyPDFLoader,
+    UnstructuredDocxLoader,
+    UnstructuredPowerPointLoader,
+    PandasExcelLoader
+)
+
+# Load a PDF
+loader = PyPDFLoader("document.pdf")
+documents = loader.load()
+
+# Load a DOCX
+loader = UnstructuredDocxLoader("document.docx", mode="elements")
+documents = loader.load()
+
+# Load a PPTX
+loader = UnstructuredPowerPointLoader("presentation.pptx")
+documents = loader.load()
+
+# Load an Excel file
+loader = PandasExcelLoader("spreadsheet.xlsx")
+documents = loader.load()
+```
--- a/services/rag/langchain/PLANNING.md
+++ b/services/rag/langchain/PLANNING.md
@@ -15,18 +15,18 @@ Chosen data folder: relatve ./../../../data - from the current folder

 # Phase 2 (installation of base framework for RAG solution and preparation for data loading)

- [ ] Install langchain as base framework for RAG solution
- [ ] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
- [ ] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)
+- [x] Install langchain as base framework for RAG solution
+- [x] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
+- [x] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)

 # Phase 3 (preparation for storing data in the vector storage + embeddings)
 - [ ] Install needed library for using Qdrant connection as vector storage. Ensure ports are used (which are needed in the chosen framework): Rest Api: 6333, gRPC Api: 6334. Database available and running on localhost.
- [ ] Create file called `vector_storage.py`, which will contain vector storage initialization, available for import by other modules of initialized. If needed in chosen RAG framework, add embedding model iniialization in the same file. Use ollama, model name defined in the .env file: OLLAMA_EMBEDDING_MODEL. Ollama available by the default local port: 11434
+- [ ] Create file called `vector_storage.py`, which will contain vector storage initialization, available for import by other modules of initialized. If needed in chosen RAG framework, add embedding model initialization in the same file. Use ollama, model name defined in the .env file: OLLAMA_EMBEDDING_MODEL. Ollama available by the default local port: 11434
 - [ ] Just in case add possibility to connect via openai embedding, using openrouter api key. Comment this section, so it can be used in the future.

 # Phase 4 (creating module for loading documents from the folder)

- [ ] Create file `enrichment.py` with the function that will load data with configured data loaders from the data folder into the vector storage. Remember to specify default embeddings meta properties, such as filename, paragraph, page, section, wherever this is possible (documents can have pages, sections, paragraphs, etc). Use text splitters of the chosen RAG framework accordingly to the documents being loaded. Which chunking/text-splitting strategies framework has, can be learned online.
+- [ ] Create file `enrichment.py` with the function that will load data with configured data loaders for extensions from the data folder into the chosen vector storage. Remember to specify default embeddings meta properties, such as filename, paragraph, page, section, wherever this is possible (documents can have pages, sections, paragraphs, etc). Use text splitters of the chosen RAG framework accordingly to the documents being loaded. Which chunking/text-splitting strategies framework has, can be learned online.
 - [ ] Use built-in strategy for marking which documents loaded (if there is such mechanism) and which are not, to avoid re-reading and re-encriching vector storage with the existing data. If there is no built-in mechanism of this type, install sqlite library and use local sqlite database file to store this information.
 - [ ] Add activation of this function in the cli entrypoint, as a command.

--- a/services/rag/langchain/QWEN.md
+++ b/services/rag/langchain/QWEN.md
@@ -18,10 +18,12 @@ The project follows a phased development approach with CLI entry points for diff

 ```
 rag-solution/services/rag/langchain/
+├── .env               # Environment variables
 ├── .env.dist          # Environment variable template
 ├── .gitignore         # Git ignore rules
 ├── app.py             # Main application file (currently empty)
 ├── cli.py             # CLI entrypoint with click library
+├── EXTENSIONS.md      # Supported file extensions and LangChain loaders
 ├── PLANNING.md        # Development roadmap and phases
 ├── QWEN.md            # Current file - project context
 ├── requirements.txt   # Python dependencies
@@ -49,10 +51,10 @@ The project is organized into 6 development phases as outlined in `PLANNING.md`:
 - [x] Implement "ping" command that outputs "pong"

 ### Phase 2: Framework Installation & Data Analysis
- [ ] Install Langchain as base RAG framework
- [ ] Analyze data folder extensions and create `EXTENSIONS.md`
- [ ] Install required loader libraries
- [ ] Configure environment variables
+- [x] Install Langchain as base RAG framework
+- [x] Analyze data folder extensions and create `EXTENSIONS.md`
+- [x] Install required loader libraries
+- [x] Configure environment variables

 ### Phase 3: Vector Storage Setup
 - [ ] Install Qdrant client library