langchain extensions for data files and their libraries
This commit is contained in:
85
services/rag/langchain/EXTENSIONS.md
Normal file
85
services/rag/langchain/EXTENSIONS.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# Supported File Extensions and LangChain Loaders
|
||||
|
||||
This document lists the file extensions found in the data directory and the recommended LangChain loaders for processing them. Only extensions that can be processed with open-source solutions (without requiring external API keys) are included.
|
||||
|
||||
## Document Types
|
||||
|
||||
| Extension | Count | LangChain Loader | Required Library | Notes |
|
||||
|-----------|-------|------------------|------------------|-------|
|
||||
| docx | 209 | UnstructuredDocxLoader | unstructured, python-docx | Microsoft Word documents |
|
||||
| pdf | 98 | PyPDFLoader or PDFPlumberLoader | pypdf or pdfplumber | PDF documents |
|
||||
| pptx | 12 | UnstructuredPowerPointLoader | unstructured | Microsoft PowerPoint presentations |
|
||||
| xlsx | 2 | PandasExcelLoader | pandas, openpyxl | Microsoft Excel spreadsheets |
|
||||
| odt | 2 | UnstructuredODTLoader | unstructured | OpenDocument Text files |
|
||||
|
||||
## Image Types
|
||||
|
||||
| Extension | Count | LangChain Loader | Required Library | Notes |
|
||||
|-----------|-------|------------------|------------------|-------|
|
||||
| jpg | 23 | UnstructuredImageLoader | unstructured, pillow | JPEG images |
|
||||
| png | 2 | UnstructuredImageLoader | unstructured, pillow | PNG images |
|
||||
|
||||
## Excluded File Types
|
||||
|
||||
The following file types were found in the data directory but are excluded from processing as per requirements:
|
||||
|
||||
- Audio files: mp4, m4a, ogg, mp3
|
||||
- System files: gitignore, DS_Store, zip
|
||||
|
||||
## Recommended Loaders Summary
|
||||
|
||||
### For Documents
|
||||
- **Unstructured loaders**: Best for extracting text from various document formats (docx, pptx, odt)
|
||||
- Libraries needed: `unstructured`, `python-docx`, `pillow`
|
||||
- Advantages: Handles formatting, tables, images within documents
|
||||
|
||||
- **PyPDFLoader**: Good for PDF files with text content
|
||||
- Library needed: `pypdf`
|
||||
- Alternative: PDFPlumberLoader for better accuracy with complex layouts
|
||||
|
||||
- **PandasExcelLoader**: For Excel spreadsheets
|
||||
- Libraries needed: `pandas`, `openpyxl`
|
||||
- Can handle multiple sheets and complex data structures
|
||||
|
||||
### Installation Commands
|
||||
|
||||
```bash
|
||||
# Install unstructured and its dependencies
|
||||
pip install unstructured python-docx pillow
|
||||
|
||||
# Install PDF processing libraries
|
||||
pip install pypdf pdfplumber
|
||||
|
||||
# Install Excel processing libraries
|
||||
pip install pandas openpyxl
|
||||
|
||||
# Install additional dependencies for unstructured
|
||||
pip install "unstructured[all-docs]"
|
||||
```
|
||||
|
||||
## Usage Example
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import (
|
||||
PyPDFLoader,
|
||||
UnstructuredDocxLoader,
|
||||
UnstructuredPowerPointLoader,
|
||||
PandasExcelLoader
|
||||
)
|
||||
|
||||
# Load a PDF
|
||||
loader = PyPDFLoader("document.pdf")
|
||||
documents = loader.load()
|
||||
|
||||
# Load a DOCX
|
||||
loader = UnstructuredDocxLoader("document.docx", mode="elements")
|
||||
documents = loader.load()
|
||||
|
||||
# Load a PPTX
|
||||
loader = UnstructuredPowerPointLoader("presentation.pptx")
|
||||
documents = loader.load()
|
||||
|
||||
# Load an Excel file
|
||||
loader = PandasExcelLoader("spreadsheet.xlsx")
|
||||
documents = loader.load()
|
||||
```
|
||||
Reference in New Issue
Block a user