# Supported File Extensions and LangChain Loaders This document lists the file extensions found in the data directory and the recommended LangChain loaders for processing them. Only extensions that can be processed with open-source solutions (without requiring external API keys) are included. ## Document Types | Extension | Count | LangChain Loader | Required Library | Notes | |-----------|-------|------------------|------------------|-------| | docx | 209 | UnstructuredDocxLoader | unstructured, python-docx | Microsoft Word documents | | pdf | 98 | PyPDFLoader or PDFPlumberLoader | pypdf or pdfplumber | PDF documents | | pptx | 12 | UnstructuredPowerPointLoader | unstructured | Microsoft PowerPoint presentations | | xlsx | 2 | PandasExcelLoader | pandas, openpyxl | Microsoft Excel spreadsheets | | odt | 2 | UnstructuredODTLoader | unstructured | OpenDocument Text files | ## Image Types | Extension | Count | LangChain Loader | Required Library | Notes | |-----------|-------|------------------|------------------|-------| | jpg | 23 | UnstructuredImageLoader | unstructured, pillow | JPEG images | | png | 2 | UnstructuredImageLoader | unstructured, pillow | PNG images | ## Excluded File Types The following file types were found in the data directory but are excluded from processing as per requirements: - Audio files: mp4, m4a, ogg, mp3 - System files: gitignore, DS_Store, zip ## Recommended Loaders Summary ### For Documents - **Unstructured loaders**: Best for extracting text from various document formats (docx, pptx, odt) - Libraries needed: `unstructured`, `python-docx`, `pillow` - Advantages: Handles formatting, tables, images within documents - **PyPDFLoader**: Good for PDF files with text content - Library needed: `pypdf` - Alternative: PDFPlumberLoader for better accuracy with complex layouts - **PandasExcelLoader**: For Excel spreadsheets - Libraries needed: `pandas`, `openpyxl` - Can handle multiple sheets and complex data structures ### Installation Commands ```bash # Install unstructured and its dependencies pip install unstructured python-docx pillow # Install PDF processing libraries pip install pypdf pdfplumber # Install Excel processing libraries pip install pandas openpyxl # Install additional dependencies for unstructured pip install "unstructured[all-docs]" ``` ## Usage Example ```python from langchain_community.document_loaders import ( PyPDFLoader, UnstructuredDocxLoader, UnstructuredPowerPointLoader, PandasExcelLoader ) # Load a PDF loader = PyPDFLoader("document.pdf") documents = loader.load() # Load a DOCX loader = UnstructuredDocxLoader("document.docx", mode="elements") documents = loader.load() # Load a PPTX loader = UnstructuredPowerPointLoader("presentation.pptx") documents = loader.load() # Load an Excel file loader = PandasExcelLoader("spreadsheet.xlsx") documents = loader.load() ```