Files
rag-solution/services/rag/langchain/EXTENSIONS.md

2.8 KiB

Supported File Extensions and LangChain Loaders

This document lists the file extensions found in the data directory and the recommended LangChain loaders for processing them. Only extensions that can be processed with open-source solutions (without requiring external API keys) are included.

Document Types

Extension Count LangChain Loader Required Library Notes
docx 209 UnstructuredDocxLoader unstructured, python-docx Microsoft Word documents
pdf 98 PyPDFLoader or PDFPlumberLoader pypdf or pdfplumber PDF documents
pptx 12 UnstructuredPowerPointLoader unstructured Microsoft PowerPoint presentations
xlsx 2 PandasExcelLoader pandas, openpyxl Microsoft Excel spreadsheets
odt 2 UnstructuredODTLoader unstructured OpenDocument Text files

Image Types

Extension Count LangChain Loader Required Library Notes
jpg 23 UnstructuredImageLoader unstructured, pillow JPEG images
png 2 UnstructuredImageLoader unstructured, pillow PNG images

Excluded File Types

The following file types were found in the data directory but are excluded from processing as per requirements:

  • Audio files: mp4, m4a, ogg, mp3
  • System files: gitignore, DS_Store, zip

For Documents

  • Unstructured loaders: Best for extracting text from various document formats (docx, pptx, odt)

    • Libraries needed: unstructured, python-docx, pillow
    • Advantages: Handles formatting, tables, images within documents
  • PyPDFLoader: Good for PDF files with text content

    • Library needed: pypdf
    • Alternative: PDFPlumberLoader for better accuracy with complex layouts
  • PandasExcelLoader: For Excel spreadsheets

    • Libraries needed: pandas, openpyxl
    • Can handle multiple sheets and complex data structures

Installation Commands

# Install unstructured and its dependencies
pip install unstructured python-docx pillow

# Install PDF processing libraries
pip install pypdf pdfplumber

# Install Excel processing libraries
pip install pandas openpyxl

# Install additional dependencies for unstructured
pip install "unstructured[all-docs]"

Usage Example

from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredDocxLoader,
    UnstructuredPowerPointLoader,
    PandasExcelLoader
)

# Load a PDF
loader = PyPDFLoader("document.pdf")
documents = loader.load()

# Load a DOCX
loader = UnstructuredDocxLoader("document.docx", mode="elements")
documents = loader.load()

# Load a PPTX
loader = UnstructuredPowerPointLoader("presentation.pptx")
documents = loader.load()

# Load an Excel file
loader = PandasExcelLoader("spreadsheet.xlsx")
documents = loader.load()