Enrichment for llamaindex. It goes for a long time using local model, so better use external model not local, for EMBEDDING

2026-02-04 16:06:01 +03:00
parent f36108d652
commit 3dea3605ad
5 changed files with 402 additions and 22 deletions
--- a/services/rag/llamaindex/PLANNING.md
+++ b/services/rag/llamaindex/PLANNING.md
@@ -27,9 +27,9 @@ Chosen data folder: relatve ./../../../data - from the current folder

 # Phase 4 (creating module for loading documents from the folder)

- [ ] Create file `enrichment.py` with the function that will load data with configured data loaders for extensions from the data folder into the chosen vector storage. Remember to specify default embeddings meta properties, such as filename, paragraph, page, section, wherever this is possible (documents can have pages, sections, paragraphs, etc). Use text splitters of the chosen RAG framework accordingly to the documents being loaded. Which chunking/text-splitting strategies framework has, can be learned online.
- [ ] Use built-in strategy for marking which documents loaded (if there is such mechanism) and which are not, to avoid re-reading and re-encriching vector storage with the existing data. If there is no built-in mechanism of this type, install sqlite library and use local sqlite database file `document_tracking.db` to store this information. Important: mark documents as read and processed ONLY when they were stored in the vector storage, to avoid marked documents being ignored when they in fact were not yet been inserted and processed.
- [ ] Add activation of this function in the cli entrypoint, as a command.
+- [x] Create file `enrichment.py` with the function that will load data with configured data loaders for extensions from the data folder into the chosen vector storage. Remember to specify default embeddings meta properties, such as filename, paragraph, page, section, wherever this is possible (documents can have pages, sections, paragraphs, etc). Use text splitters of the chosen RAG framework accordingly to the documents being loaded. Which chunking/text-splitting strategies framework has, can be learned online.
+- [x] Use built-in strategy for marking which documents loaded (if there is such mechanism) and which are not, to avoid re-reading and re-encriching vector storage with the existing data. If there is no built-in mechanism of this type, install sqlite library and use local sqlite database file `document_tracking.db` to store this information. Important: mark documents as read and processed ONLY when they were stored in the vector storage, to avoid marked documents being ignored when they in fact were not yet been inserted and processed.
+- [x] Add activation of this function in the cli entrypoint, as a command.

 # Phase 5 (preparation for the retrieval feature)