85 lines
2.8 KiB
Markdown
85 lines
2.8 KiB
Markdown
|
|
# Supported File Extensions and LangChain Loaders
|
||
|
|
|
||
|
|
This document lists the file extensions found in the data directory and the recommended LangChain loaders for processing them. Only extensions that can be processed with open-source solutions (without requiring external API keys) are included.
|
||
|
|
|
||
|
|
## Document Types
|
||
|
|
|
||
|
|
| Extension | Count | LangChain Loader | Required Library | Notes |
|
||
|
|
|-----------|-------|------------------|------------------|-------|
|
||
|
|
| docx | 209 | UnstructuredDocxLoader | unstructured, python-docx | Microsoft Word documents |
|
||
|
|
| pdf | 98 | PyPDFLoader or PDFPlumberLoader | pypdf or pdfplumber | PDF documents |
|
||
|
|
| pptx | 12 | UnstructuredPowerPointLoader | unstructured | Microsoft PowerPoint presentations |
|
||
|
|
| xlsx | 2 | PandasExcelLoader | pandas, openpyxl | Microsoft Excel spreadsheets |
|
||
|
|
| odt | 2 | UnstructuredODTLoader | unstructured | OpenDocument Text files |
|
||
|
|
|
||
|
|
## Image Types
|
||
|
|
|
||
|
|
| Extension | Count | LangChain Loader | Required Library | Notes |
|
||
|
|
|-----------|-------|------------------|------------------|-------|
|
||
|
|
| jpg | 23 | UnstructuredImageLoader | unstructured, pillow | JPEG images |
|
||
|
|
| png | 2 | UnstructuredImageLoader | unstructured, pillow | PNG images |
|
||
|
|
|
||
|
|
## Excluded File Types
|
||
|
|
|
||
|
|
The following file types were found in the data directory but are excluded from processing as per requirements:
|
||
|
|
|
||
|
|
- Audio files: mp4, m4a, ogg, mp3
|
||
|
|
- System files: gitignore, DS_Store, zip
|
||
|
|
|
||
|
|
## Recommended Loaders Summary
|
||
|
|
|
||
|
|
### For Documents
|
||
|
|
- **Unstructured loaders**: Best for extracting text from various document formats (docx, pptx, odt)
|
||
|
|
- Libraries needed: `unstructured`, `python-docx`, `pillow`
|
||
|
|
- Advantages: Handles formatting, tables, images within documents
|
||
|
|
|
||
|
|
- **PyPDFLoader**: Good for PDF files with text content
|
||
|
|
- Library needed: `pypdf`
|
||
|
|
- Alternative: PDFPlumberLoader for better accuracy with complex layouts
|
||
|
|
|
||
|
|
- **PandasExcelLoader**: For Excel spreadsheets
|
||
|
|
- Libraries needed: `pandas`, `openpyxl`
|
||
|
|
- Can handle multiple sheets and complex data structures
|
||
|
|
|
||
|
|
### Installation Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Install unstructured and its dependencies
|
||
|
|
pip install unstructured python-docx pillow
|
||
|
|
|
||
|
|
# Install PDF processing libraries
|
||
|
|
pip install pypdf pdfplumber
|
||
|
|
|
||
|
|
# Install Excel processing libraries
|
||
|
|
pip install pandas openpyxl
|
||
|
|
|
||
|
|
# Install additional dependencies for unstructured
|
||
|
|
pip install "unstructured[all-docs]"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Usage Example
|
||
|
|
|
||
|
|
```python
|
||
|
|
from langchain_community.document_loaders import (
|
||
|
|
PyPDFLoader,
|
||
|
|
UnstructuredDocxLoader,
|
||
|
|
UnstructuredPowerPointLoader,
|
||
|
|
PandasExcelLoader
|
||
|
|
)
|
||
|
|
|
||
|
|
# Load a PDF
|
||
|
|
loader = PyPDFLoader("document.pdf")
|
||
|
|
documents = loader.load()
|
||
|
|
|
||
|
|
# Load a DOCX
|
||
|
|
loader = UnstructuredDocxLoader("document.docx", mode="elements")
|
||
|
|
documents = loader.load()
|
||
|
|
|
||
|
|
# Load a PPTX
|
||
|
|
loader = UnstructuredPowerPointLoader("presentation.pptx")
|
||
|
|
documents = loader.load()
|
||
|
|
|
||
|
|
# Load an Excel file
|
||
|
|
loader = PandasExcelLoader("spreadsheet.xlsx")
|
||
|
|
documents = loader.load()
|
||
|
|
```
|