langchain extensions for data files and their libraries

This commit is contained in:
2026-02-03 20:17:13 +03:00
parent d99433d087
commit cd7c96e022
3 changed files with 96 additions and 9 deletions

View File

@@ -0,0 +1,85 @@
# Supported File Extensions and LangChain Loaders
This document lists the file extensions found in the data directory and the recommended LangChain loaders for processing them. Only extensions that can be processed with open-source solutions (without requiring external API keys) are included.
## Document Types
| Extension | Count | LangChain Loader | Required Library | Notes |
|-----------|-------|------------------|------------------|-------|
| docx | 209 | UnstructuredDocxLoader | unstructured, python-docx | Microsoft Word documents |
| pdf | 98 | PyPDFLoader or PDFPlumberLoader | pypdf or pdfplumber | PDF documents |
| pptx | 12 | UnstructuredPowerPointLoader | unstructured | Microsoft PowerPoint presentations |
| xlsx | 2 | PandasExcelLoader | pandas, openpyxl | Microsoft Excel spreadsheets |
| odt | 2 | UnstructuredODTLoader | unstructured | OpenDocument Text files |
## Image Types
| Extension | Count | LangChain Loader | Required Library | Notes |
|-----------|-------|------------------|------------------|-------|
| jpg | 23 | UnstructuredImageLoader | unstructured, pillow | JPEG images |
| png | 2 | UnstructuredImageLoader | unstructured, pillow | PNG images |
## Excluded File Types
The following file types were found in the data directory but are excluded from processing as per requirements:
- Audio files: mp4, m4a, ogg, mp3
- System files: gitignore, DS_Store, zip
## Recommended Loaders Summary
### For Documents
- **Unstructured loaders**: Best for extracting text from various document formats (docx, pptx, odt)
- Libraries needed: `unstructured`, `python-docx`, `pillow`
- Advantages: Handles formatting, tables, images within documents
- **PyPDFLoader**: Good for PDF files with text content
- Library needed: `pypdf`
- Alternative: PDFPlumberLoader for better accuracy with complex layouts
- **PandasExcelLoader**: For Excel spreadsheets
- Libraries needed: `pandas`, `openpyxl`
- Can handle multiple sheets and complex data structures
### Installation Commands
```bash
# Install unstructured and its dependencies
pip install unstructured python-docx pillow
# Install PDF processing libraries
pip install pypdf pdfplumber
# Install Excel processing libraries
pip install pandas openpyxl
# Install additional dependencies for unstructured
pip install "unstructured[all-docs]"
```
## Usage Example
```python
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredDocxLoader,
UnstructuredPowerPointLoader,
PandasExcelLoader
)
# Load a PDF
loader = PyPDFLoader("document.pdf")
documents = loader.load()
# Load a DOCX
loader = UnstructuredDocxLoader("document.docx", mode="elements")
documents = loader.load()
# Load a PPTX
loader = UnstructuredPowerPointLoader("presentation.pptx")
documents = loader.load()
# Load an Excel file
loader = PandasExcelLoader("spreadsheet.xlsx")
documents = loader.load()
```