File extensions and libraries for llamaindex

This commit is contained in:
2026-02-04 01:02:21 +03:00
parent fa26d77520
commit c37aec1d99
3 changed files with 52 additions and 6 deletions

View File

@@ -0,0 +1,46 @@
# Supported File Extensions and LlamaIndex Loaders
This document lists the file extensions found in the `./../../../data` directory and the corresponding LlamaIndex loaders that can be used to process them.
## Document Formats
| Extension | File Type | LlamaIndex Loader | Installation Package |
|-----------|-----------|-------------------|---------------------|
| `.pdf` | Portable Document Format | `PDFReader` or `SimpleDirectoryReader` | `llama-index-readers-file` |
| `.docx` | Microsoft Word Document | `DocxReader` or `SimpleDirectoryReader` | `llama-index-readers-file` |
| `.xlsx` | Microsoft Excel Spreadsheet | `PandasExcelReader` or `SimpleDirectoryReader` | `llama-index-readers-file` |
| `.pptx` | Microsoft PowerPoint Presentation | `PptxReader` or `SimpleDirectoryReader` | `llama-index-readers-file` |
| `.odt` | OpenDocument Text | `SimpleDirectoryReader` with `UnstructuredReader` | `llama-index-readers-file` |
## Image Formats
| Extension | File Type | LlamaIndex Loader | Installation Package |
|-----------|-----------|-------------------|---------------------|
| `.png` | Portable Network Graphics | `ImageReader` | `llama-index-readers-file` |
| `.jpg` | JPEG Image | `ImageReader` | `llama-index-readers-file` |
## Archive Formats
| Extension | File Type | LlamaIndex Loader | Installation Package |
|-----------|-----------|-------------------|---------------------|
| `.zip` | ZIP Archive | `SimpleDirectoryReader` with archive support | `llama-index-readers-file` |
## System/Special Files (Ignored)
- `.DS_Store` - macOS system file
- `.gitignore` - Git configuration file
## Audio/Video Formats (Skipped as per requirements)
- `.m4a` - Audio file
- `.mp3` - Audio file
- `.mp4` - Video file
- `.ogg` - Audio/Video file
## Notes
1. Many file types can be loaded using the `SimpleDirectoryReader` which automatically detects and handles multiple file formats.
2. For advanced document parsing, specific readers might offer better performance or more features.
3. All required dependencies have been installed with `llama-index-readers-file` and `patool` for archive support.
4. No external API keys are required for the supported file types, as we're using local processing solutions.
5. The system prioritizes local processing over cloud services as per requirements.

View File

@@ -16,8 +16,8 @@ Chosen data folder: relatve ./../../../data - from the current folder
# Phase 2 (installation of base framework for RAG solution and preparation for data loading)
- [x] Install llamaindex as base framework for RAG solution.
- [ ] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
- [ ] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)
- [x] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
- [x] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)
# Phase 3 (preparation for storing data in the vector storage + embeddings)
- [ ] Install needed library for using Qdrant connection as vector storage. Ensure ports are used (which are needed in the chosen framework): Rest Api: 6333, gRPC Api: 6334. Database available and running on localhost.

View File

@@ -80,8 +80,8 @@ This is a Retrieval Augmented Generation (RAG) solution built using LlamaIndex a
### Phase 2: Framework Installation
- [x] LlamaIndex installation
- [ ] Data folder analysis and EXTENSIONS.md creation
- [ ] Required loader libraries installation
- [x] Data folder analysis and EXTENSIONS.md creation
- [x] Required loader libraries installation
### Phase 3: Vector Storage Setup
- [ ] Qdrant library installation
@@ -113,9 +113,9 @@ llamaindex/
├── enrichment.py # Document loading and processing (to be created)
├── retrieval.py # Search and retrieval functionality (to be created)
├── agent.py # Chat agent implementation (to be created)
├── EXTENSIONS.md # Supported file extensions and loaders (to be created)
├── EXTENSIONS.md # Supported file extensions and loaders
├── .env.dist # Environment variable template
├── .env # Local environment variables (git-ignored)
├── .env # Local environment variables
├── logs/ # Log files directory
│ └── dev.log # Main log file with rotation
└── PLANNING.md # Project planning document