From c37aec1d99692a87c0883e8a7ab58fc30b1a8058 Mon Sep 17 00:00:00 2001 From: idchlife Date: Wed, 4 Feb 2026 01:02:21 +0300 Subject: [PATCH] File extensions and libraries for llamaindex --- services/rag/llamaindex/EXTENSIONS.md | 46 +++++++++++++++++++++++++++ services/rag/llamaindex/PLANNING.md | 4 +-- services/rag/llamaindex/QWEN.md | 8 ++--- 3 files changed, 52 insertions(+), 6 deletions(-) create mode 100644 services/rag/llamaindex/EXTENSIONS.md diff --git a/services/rag/llamaindex/EXTENSIONS.md b/services/rag/llamaindex/EXTENSIONS.md new file mode 100644 index 0000000..667e246 --- /dev/null +++ b/services/rag/llamaindex/EXTENSIONS.md @@ -0,0 +1,46 @@ +# Supported File Extensions and LlamaIndex Loaders + +This document lists the file extensions found in the `./../../../data` directory and the corresponding LlamaIndex loaders that can be used to process them. + +## Document Formats + +| Extension | File Type | LlamaIndex Loader | Installation Package | +|-----------|-----------|-------------------|---------------------| +| `.pdf` | Portable Document Format | `PDFReader` or `SimpleDirectoryReader` | `llama-index-readers-file` | +| `.docx` | Microsoft Word Document | `DocxReader` or `SimpleDirectoryReader` | `llama-index-readers-file` | +| `.xlsx` | Microsoft Excel Spreadsheet | `PandasExcelReader` or `SimpleDirectoryReader` | `llama-index-readers-file` | +| `.pptx` | Microsoft PowerPoint Presentation | `PptxReader` or `SimpleDirectoryReader` | `llama-index-readers-file` | +| `.odt` | OpenDocument Text | `SimpleDirectoryReader` with `UnstructuredReader` | `llama-index-readers-file` | + +## Image Formats + +| Extension | File Type | LlamaIndex Loader | Installation Package | +|-----------|-----------|-------------------|---------------------| +| `.png` | Portable Network Graphics | `ImageReader` | `llama-index-readers-file` | +| `.jpg` | JPEG Image | `ImageReader` | `llama-index-readers-file` | + +## Archive Formats + +| Extension | File Type | LlamaIndex Loader | Installation Package | +|-----------|-----------|-------------------|---------------------| +| `.zip` | ZIP Archive | `SimpleDirectoryReader` with archive support | `llama-index-readers-file` | + +## System/Special Files (Ignored) + +- `.DS_Store` - macOS system file +- `.gitignore` - Git configuration file + +## Audio/Video Formats (Skipped as per requirements) + +- `.m4a` - Audio file +- `.mp3` - Audio file +- `.mp4` - Video file +- `.ogg` - Audio/Video file + +## Notes + +1. Many file types can be loaded using the `SimpleDirectoryReader` which automatically detects and handles multiple file formats. +2. For advanced document parsing, specific readers might offer better performance or more features. +3. All required dependencies have been installed with `llama-index-readers-file` and `patool` for archive support. +4. No external API keys are required for the supported file types, as we're using local processing solutions. +5. The system prioritizes local processing over cloud services as per requirements. \ No newline at end of file diff --git a/services/rag/llamaindex/PLANNING.md b/services/rag/llamaindex/PLANNING.md index 10b9c47..77d26ba 100644 --- a/services/rag/llamaindex/PLANNING.md +++ b/services/rag/llamaindex/PLANNING.md @@ -16,8 +16,8 @@ Chosen data folder: relatve ./../../../data - from the current folder # Phase 2 (installation of base framework for RAG solution and preparation for data loading) - [x] Install llamaindex as base framework for RAG solution. -- [ ] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now. -- [ ] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist) +- [x] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now. +- [x] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist) # Phase 3 (preparation for storing data in the vector storage + embeddings) - [ ] Install needed library for using Qdrant connection as vector storage. Ensure ports are used (which are needed in the chosen framework): Rest Api: 6333, gRPC Api: 6334. Database available and running on localhost. diff --git a/services/rag/llamaindex/QWEN.md b/services/rag/llamaindex/QWEN.md index a6092de..4ac932a 100644 --- a/services/rag/llamaindex/QWEN.md +++ b/services/rag/llamaindex/QWEN.md @@ -80,8 +80,8 @@ This is a Retrieval Augmented Generation (RAG) solution built using LlamaIndex a ### Phase 2: Framework Installation - [x] LlamaIndex installation -- [ ] Data folder analysis and EXTENSIONS.md creation -- [ ] Required loader libraries installation +- [x] Data folder analysis and EXTENSIONS.md creation +- [x] Required loader libraries installation ### Phase 3: Vector Storage Setup - [ ] Qdrant library installation @@ -113,9 +113,9 @@ llamaindex/ ├── enrichment.py # Document loading and processing (to be created) ├── retrieval.py # Search and retrieval functionality (to be created) ├── agent.py # Chat agent implementation (to be created) -├── EXTENSIONS.md # Supported file extensions and loaders (to be created) +├── EXTENSIONS.md # Supported file extensions and loaders ├── .env.dist # Environment variable template -├── .env # Local environment variables (git-ignored) +├── .env # Local environment variables ├── logs/ # Log files directory │ └── dev.log # Main log file with rotation └── PLANNING.md # Project planning document