langchain extensions for data files and their libraries

2026-02-03 20:17:13 +03:00
parent d99433d087
commit cd7c96e022
3 changed files with 96 additions and 9 deletions
--- a/services/rag/langchain/EXTENSIONS.md
+++ b/services/rag/langchain/EXTENSIONS.md
@@ -0,0 +1,85 @@
+# Supported File Extensions and LangChain Loaders
+
+This document lists the file extensions found in the data directory and the recommended LangChain loaders for processing them. Only extensions that can be processed with open-source solutions (without requiring external API keys) are included.
+
+## Document Types
+
+| Extension | Count | LangChain Loader | Required Library | Notes |
+|-----------|-------|------------------|------------------|-------|
+| docx | 209 | UnstructuredDocxLoader | unstructured, python-docx | Microsoft Word documents |
+| pdf | 98 | PyPDFLoader or PDFPlumberLoader | pypdf or pdfplumber | PDF documents |
+| pptx | 12 | UnstructuredPowerPointLoader | unstructured | Microsoft PowerPoint presentations |
+| xlsx | 2 | PandasExcelLoader | pandas, openpyxl | Microsoft Excel spreadsheets |
+| odt | 2 | UnstructuredODTLoader | unstructured | OpenDocument Text files |
+
+## Image Types
+
+| Extension | Count | LangChain Loader | Required Library | Notes |
+|-----------|-------|------------------|------------------|-------|
+| jpg | 23 | UnstructuredImageLoader | unstructured, pillow | JPEG images |
+| png | 2 | UnstructuredImageLoader | unstructured, pillow | PNG images |
+
+## Excluded File Types
+
+The following file types were found in the data directory but are excluded from processing as per requirements:
+
+- Audio files: mp4, m4a, ogg, mp3
+- System files: gitignore, DS_Store, zip
+
+## Recommended Loaders Summary
+
+### For Documents
+- **Unstructured loaders**: Best for extracting text from various document formats (docx, pptx, odt)
+  - Libraries needed: `unstructured`, `python-docx`, `pillow`
+  - Advantages: Handles formatting, tables, images within documents
+
+- **PyPDFLoader**: Good for PDF files with text content
+  - Library needed: `pypdf`
+  - Alternative: PDFPlumberLoader for better accuracy with complex layouts
+
+- **PandasExcelLoader**: For Excel spreadsheets
+  - Libraries needed: `pandas`, `openpyxl`
+  - Can handle multiple sheets and complex data structures
+
+### Installation Commands
+
+```bash
+# Install unstructured and its dependencies
+pip install unstructured python-docx pillow
+
+# Install PDF processing libraries
+pip install pypdf pdfplumber
+
+# Install Excel processing libraries
+pip install pandas openpyxl
+
+# Install additional dependencies for unstructured
+pip install "unstructured[all-docs]"
+```
+
+## Usage Example
+
+```python
+from langchain_community.document_loaders import (
+    PyPDFLoader,
+    UnstructuredDocxLoader,
+    UnstructuredPowerPointLoader,
+    PandasExcelLoader
+)
+
+# Load a PDF
+loader = PyPDFLoader("document.pdf")
+documents = loader.load()
+
+# Load a DOCX
+loader = UnstructuredDocxLoader("document.docx", mode="elements")
+documents = loader.load()
+
+# Load a PPTX
+loader = UnstructuredPowerPointLoader("presentation.pptx")
+documents = loader.load()
+
+# Load an Excel file
+loader = PandasExcelLoader("spreadsheet.xlsx")
+documents = loader.load()
+```