File extensions and libraries for llamaindex
This commit is contained in:
46
services/rag/llamaindex/EXTENSIONS.md
Normal file
46
services/rag/llamaindex/EXTENSIONS.md
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
# Supported File Extensions and LlamaIndex Loaders
|
||||||
|
|
||||||
|
This document lists the file extensions found in the `./../../../data` directory and the corresponding LlamaIndex loaders that can be used to process them.
|
||||||
|
|
||||||
|
## Document Formats
|
||||||
|
|
||||||
|
| Extension | File Type | LlamaIndex Loader | Installation Package |
|
||||||
|
|-----------|-----------|-------------------|---------------------|
|
||||||
|
| `.pdf` | Portable Document Format | `PDFReader` or `SimpleDirectoryReader` | `llama-index-readers-file` |
|
||||||
|
| `.docx` | Microsoft Word Document | `DocxReader` or `SimpleDirectoryReader` | `llama-index-readers-file` |
|
||||||
|
| `.xlsx` | Microsoft Excel Spreadsheet | `PandasExcelReader` or `SimpleDirectoryReader` | `llama-index-readers-file` |
|
||||||
|
| `.pptx` | Microsoft PowerPoint Presentation | `PptxReader` or `SimpleDirectoryReader` | `llama-index-readers-file` |
|
||||||
|
| `.odt` | OpenDocument Text | `SimpleDirectoryReader` with `UnstructuredReader` | `llama-index-readers-file` |
|
||||||
|
|
||||||
|
## Image Formats
|
||||||
|
|
||||||
|
| Extension | File Type | LlamaIndex Loader | Installation Package |
|
||||||
|
|-----------|-----------|-------------------|---------------------|
|
||||||
|
| `.png` | Portable Network Graphics | `ImageReader` | `llama-index-readers-file` |
|
||||||
|
| `.jpg` | JPEG Image | `ImageReader` | `llama-index-readers-file` |
|
||||||
|
|
||||||
|
## Archive Formats
|
||||||
|
|
||||||
|
| Extension | File Type | LlamaIndex Loader | Installation Package |
|
||||||
|
|-----------|-----------|-------------------|---------------------|
|
||||||
|
| `.zip` | ZIP Archive | `SimpleDirectoryReader` with archive support | `llama-index-readers-file` |
|
||||||
|
|
||||||
|
## System/Special Files (Ignored)
|
||||||
|
|
||||||
|
- `.DS_Store` - macOS system file
|
||||||
|
- `.gitignore` - Git configuration file
|
||||||
|
|
||||||
|
## Audio/Video Formats (Skipped as per requirements)
|
||||||
|
|
||||||
|
- `.m4a` - Audio file
|
||||||
|
- `.mp3` - Audio file
|
||||||
|
- `.mp4` - Video file
|
||||||
|
- `.ogg` - Audio/Video file
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
1. Many file types can be loaded using the `SimpleDirectoryReader` which automatically detects and handles multiple file formats.
|
||||||
|
2. For advanced document parsing, specific readers might offer better performance or more features.
|
||||||
|
3. All required dependencies have been installed with `llama-index-readers-file` and `patool` for archive support.
|
||||||
|
4. No external API keys are required for the supported file types, as we're using local processing solutions.
|
||||||
|
5. The system prioritizes local processing over cloud services as per requirements.
|
||||||
@@ -16,8 +16,8 @@ Chosen data folder: relatve ./../../../data - from the current folder
|
|||||||
# Phase 2 (installation of base framework for RAG solution and preparation for data loading)
|
# Phase 2 (installation of base framework for RAG solution and preparation for data loading)
|
||||||
|
|
||||||
- [x] Install llamaindex as base framework for RAG solution.
|
- [x] Install llamaindex as base framework for RAG solution.
|
||||||
- [ ] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
|
- [x] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
|
||||||
- [ ] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)
|
- [x] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)
|
||||||
|
|
||||||
# Phase 3 (preparation for storing data in the vector storage + embeddings)
|
# Phase 3 (preparation for storing data in the vector storage + embeddings)
|
||||||
- [ ] Install needed library for using Qdrant connection as vector storage. Ensure ports are used (which are needed in the chosen framework): Rest Api: 6333, gRPC Api: 6334. Database available and running on localhost.
|
- [ ] Install needed library for using Qdrant connection as vector storage. Ensure ports are used (which are needed in the chosen framework): Rest Api: 6333, gRPC Api: 6334. Database available and running on localhost.
|
||||||
|
|||||||
@@ -80,8 +80,8 @@ This is a Retrieval Augmented Generation (RAG) solution built using LlamaIndex a
|
|||||||
|
|
||||||
### Phase 2: Framework Installation
|
### Phase 2: Framework Installation
|
||||||
- [x] LlamaIndex installation
|
- [x] LlamaIndex installation
|
||||||
- [ ] Data folder analysis and EXTENSIONS.md creation
|
- [x] Data folder analysis and EXTENSIONS.md creation
|
||||||
- [ ] Required loader libraries installation
|
- [x] Required loader libraries installation
|
||||||
|
|
||||||
### Phase 3: Vector Storage Setup
|
### Phase 3: Vector Storage Setup
|
||||||
- [ ] Qdrant library installation
|
- [ ] Qdrant library installation
|
||||||
@@ -113,9 +113,9 @@ llamaindex/
|
|||||||
├── enrichment.py # Document loading and processing (to be created)
|
├── enrichment.py # Document loading and processing (to be created)
|
||||||
├── retrieval.py # Search and retrieval functionality (to be created)
|
├── retrieval.py # Search and retrieval functionality (to be created)
|
||||||
├── agent.py # Chat agent implementation (to be created)
|
├── agent.py # Chat agent implementation (to be created)
|
||||||
├── EXTENSIONS.md # Supported file extensions and loaders (to be created)
|
├── EXTENSIONS.md # Supported file extensions and loaders
|
||||||
├── .env.dist # Environment variable template
|
├── .env.dist # Environment variable template
|
||||||
├── .env # Local environment variables (git-ignored)
|
├── .env # Local environment variables
|
||||||
├── logs/ # Log files directory
|
├── logs/ # Log files directory
|
||||||
│ └── dev.log # Main log file with rotation
|
│ └── dev.log # Main log file with rotation
|
||||||
└── PLANNING.md # Project planning document
|
└── PLANNING.md # Project planning document
|
||||||
|
|||||||
Reference in New Issue
Block a user