Supported File Extensions and LangChain Loaders

This document lists the file extensions found in the data directory and the recommended LangChain loaders for processing them. Only extensions that can be processed with open-source solutions (without requiring external API keys) are included.

Document Types

Extension	Count	LangChain Loader	Required Library	Notes
docx	209	UnstructuredDocxLoader	unstructured, python-docx	Microsoft Word documents
pdf	98	PyPDFLoader or PDFPlumberLoader	pypdf or pdfplumber	PDF documents
pptx	12	UnstructuredPowerPointLoader	unstructured	Microsoft PowerPoint presentations
xlsx	2	PandasExcelLoader	pandas, openpyxl	Microsoft Excel spreadsheets
odt	2	UnstructuredODTLoader	unstructured	OpenDocument Text files

Image Types

Extension	Count	LangChain Loader	Required Library	Notes
jpg	23	UnstructuredImageLoader	unstructured, pillow	JPEG images
png	2	UnstructuredImageLoader	unstructured, pillow	PNG images

Excluded File Types

The following file types were found in the data directory but are excluded from processing as per requirements:

Audio files: mp4, m4a, ogg, mp3
System files: gitignore, DS_Store, zip

Recommended Loaders Summary

For Documents

Unstructured loaders: Best for extracting text from various document formats (docx, pptx, odt)
- Libraries needed: unstructured, python-docx, pillow
- Advantages: Handles formatting, tables, images within documents
PyPDFLoader: Good for PDF files with text content
- Library needed: pypdf
- Alternative: PDFPlumberLoader for better accuracy with complex layouts
PandasExcelLoader: For Excel spreadsheets
- Libraries needed: pandas, openpyxl
- Can handle multiple sheets and complex data structures

Installation Commands

# Install unstructured and its dependencies
pip install unstructured python-docx pillow

# Install PDF processing libraries
pip install pypdf pdfplumber

# Install Excel processing libraries
pip install pandas openpyxl

# Install additional dependencies for unstructured
pip install "unstructured[all-docs]"

Usage Example

from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredDocxLoader,
    UnstructuredPowerPointLoader,
    PandasExcelLoader
)

# Load a PDF
loader = PyPDFLoader("document.pdf")
documents = loader.load()

# Load a DOCX
loader = UnstructuredDocxLoader("document.docx", mode="elements")
documents = loader.load()

# Load a PPTX
loader = UnstructuredPowerPointLoader("presentation.pptx")
documents = loader.load()

# Load an Excel file
loader = PandasExcelLoader("spreadsheet.xlsx")
documents = loader.load()

2.8 KiB Raw Blame History