rag-solution/services/rag/llamaindex/PLANNING.md

# Requirements

Libraries should be installed into the local virtual environment, which is defined in the `venv` folder.
If some libraries are not installed, check online which are best and install them.
Use if possible logging, using library `loguru`, for steps. Use logrotation in file `logs/dev.log`, also log to stdout

Chosen RAG framework: Llamaindex
Chosen Vector Storage: Qdrant
Chosen data folder: relatve ./../../../data - from the current folder

# Phase 1 (cli entrypoint)

- [x] Create virtual env in the `venv` folder in the current directory.
- [x] Create cli.py file, with the usage of `click` python library. Make default command "ping" which will write output "pong"

# Phase 2 (installation of base framework for RAG solution and preparation for data loading)

- [x] Install llamaindex as base framework for RAG solution.
- [x] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
- [x] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)

# Phase 3 (preparation for storing data in the vector storage + embeddings)
- [x] Install needed library for using Qdrant connection as vector storage. Ensure ports are used (which are needed in the chosen framework): Rest Api: 6333, gRPC Api: 6334. Database available and running on localhost.
- [x] Create file called `vector_storage.py`, which will contain vector storage initialization, available for import by other modules of initialized. If needed in chosen RAG framework, add embedding model initialization in the same file. Use ollama, model name defined in the .env file: OLLAMA_EMBEDDING_MODEL. Ollama available by the default local port: 11434.
- [x] Add strategy of creating collection for this project (name: "documents_llamaindex"), if it does not exist.
- [x] Just in case add possibility to connect via openai embedding, using openrouter api key. Comment this section, so it can be used in the future.

# Phase 4 (creating module for loading documents from the folder)

- [x] Create file `enrichment.py` with the function that will load data with configured data loaders for extensions from the data folder into the chosen vector storage. Remember to specify default embeddings meta properties, such as filename, paragraph, page, section, wherever this is possible (documents can have pages, sections, paragraphs, etc). Use text splitters of the chosen RAG framework accordingly to the documents being loaded. Which chunking/text-splitting strategies framework has, can be learned online.
- [x] Use built-in strategy for marking which documents loaded (if there is such mechanism) and which are not, to avoid re-reading and re-encriching vector storage with the existing data. If there is no built-in mechanism of this type, install sqlite library and use local sqlite database file `document_tracking.db` to store this information. Important: mark documents as read and processed ONLY when they were stored in the vector storage, to avoid marked documents being ignored when they in fact were not yet been inserted and processed.
- [x] Add activation of this function in the cli entrypoint, as a command.

# Phase 5 (preparation for the retrieval feature)

- [x] Create file `retrieval.py` with the configuration for chosen RAG framework, that will retrieve data from the vector storage based on the query. Use retrieving library/plugin, that supports chosen vector storage within the chosen RAG framework. Retrieving configuration should search for the provided text in the query as argument in the function and return found information with the stored meta data, like paragraph, section, page etc. Important: if for chosen RAG framework, there is no need in separation of search, separation of retrieving from the chosen vector storage, this step may be skipped and marked done.

# Phase 6 (models strategy, loading env and update on using openai models)

- [x] Add `CHAT_STRATEGY`, `EMBEDDING_STRATEGY` fields to .env, possible values are "openai" or "ollama".
- [x] Add `OPENAI_CHAT_URL`, `OPENAI_CHAT_KEY`, `OPENAI_EMBEDDING_MODEL`, `OPENAI_EMBEDDING_BASE_URL`, `OPENAI_EMBEDDING_API_KEY` values to .env.dist with dummy values and to .env with dummy values.
- [x] Add in all important .env wise places in the code loading .env file for it's variables
- [x] Create reusable function, that will return configuration for models. It will check CHAT_STRATEGY and load environment variables accordingly, and return config for usage.
- [x] Add this function everywhere in the codebase where chat or embedding models configuration needed

# Phase 7 (explicit logging and progressbar)

- [x] Add log of how many files currently being processed in enrichment. We need to see how many total to process and how many processed each time new document being processed. If it's possible, also add progressbar showing percentage and those numbers on top of logs.

# Phase 8 (comment unsupported formats for now)

- [x] Remove for now formats, extensions for images of any kind, archives of any kind, and add possible text documents, documents formats, like .txt, .xlsx, etc. in enrichment processes/functions.

# Phase 9 (integration of Prefect client, for creating flow and tasks on remote Prefect server)

- [x] Install Prefect client library.
- [x] Add .env variable PREFECT_API_URL, that will be used for connecting client to the prefect server
- [x] Create prefect client file in `prefect/01_yadisk_predefined_enrich.py`. This file will firt load file from ./../../../yadisk_files.json into array of paths. After that, array of paths will be filtered, and only supported in enrichment extensions will be left. After that, code will iterate through each path in this filtered array, use yadisk library to download file, process it for enrichment, and the remove it after processing. There should be statistics for this, at runtime, with progressbar that shows how many files processed out of how many left. Also, near the progressbar there should be counter of errors. Yes, if there is an error, it should be swallowed, even if it is inside thred or async function.
- [x] For yandex disk integration use library yadisk. In .env file there should be variable YADISK_TOKEN for accessing the needed connection
- [x] Code for loading should be reflected upon, and then made it so it would be done in async way, with as much as possible simulatenous tasks. yadisk async integration should be used (async features can be checked here: https://pypi.org/project/yadisk/)
- [x] No tests for code should be done at this phase, all tests will be done manually, because loading of documents can take a long time for automated test.

# Phase 10 (qdrant connection credentials in .env)

- [x] Add Qdrant connection variables to the .env file: QDRANT_HOST, QDRANT_REST_PORT, QDRANT_GRPC_PORT
- [x] Replace everywhere where Qdran connection used hardcoded values into the usage of Qdrant .env variables

# Phase 11 (http endpoint to retrieve data from the vector storage by query)

- [x] Create file `server.py`, with web framework fastapi, for example
- [x] Add POST endpoint "/api/test-query" which will use agent, and retrieve response for query, sent in JSON format, field "query"

# Phase 12 (upgrade from simple retrieval to agent-like chat in LlamaIndex)

- [x] Revisit Phase 5 assumption ("simple retrieval only") and explicitly allow agent/chat orchestration in LlamaIndex for QA over documents.
- [x] Create new module for chat orchestration (for example `agent.py` or `chat_engine.py`) that separates:
  1) retrieval of source nodes
  2) answer synthesis with explicit prompt
  3) response formatting with sources/metadata
- [x] Implement a LlamaIndex-based chat feature (agent-like behavior) using framework-native primitives (chat engine / agent workflow / tool-calling approach supported by installed version), so the model can iteratively query retrieval tools when needed.
- [x] Add a retrieval tool wrapper for document search that returns structured snippets (`filename`, `file_path`, `page_label/page`, `chunk_number`, content preview, score) instead of raw text only.
- [x] Add a grounded answer prompt/template for the LlamaIndex chat path with rules:
  - answer only from retrieved context
  - if information is missing, say so directly
  - prefer exact dates/years and quote filenames/pages where possible
  - avoid generic claims not supported by sources
- [x] Add response mode that returns both:
  - final answer text
  - list of retrieved sources (content snippet + metadata + score)
- [x] Add post-processing for retrieved nodes before synthesis:
  - deduplicate near-identical chunks
  - drop empty / near-empty chunks
  - optionally filter low-information chunks (headers/footers)
- [x] Add optional metadata-aware retrieval improvements (years/events/keywords) parity with LangChain approach (folder near current folder), if feasible in the chosen LlamaIndex primitives.
- [x] Update `server.py` endpoint to use the new agent-like chat path (keep simple retrieval endpoint available as fallback or debug mode).
Start of work on Llamaindex framework 2026-02-04 00:49:45 +03:00			`# Requirements`

			Libraries should be installed into the local virtual environment, which is defined in the `venv` folder.
			`If some libraries are not installed, check online which are best and install them.`
			Use if possible logging, using library `loguru`, for steps. Use logrotation in file `logs/dev.log`, also log to stdout

			`Chosen RAG framework: Llamaindex`
			`Chosen Vector Storage: Qdrant`
			`Chosen data folder: relatve ./../../../data - from the current folder`

			`# Phase 1 (cli entrypoint)`

			- [x] Create virtual env in the `venv` folder in the current directory.
Cli with ping for llamaindex 2026-02-04 00:59:01 +03:00			- [x] Create cli.py file, with the usage of `click` python library. Make default command "ping" which will write output "pong"
Start of work on Llamaindex framework 2026-02-04 00:49:45 +03:00
			`# Phase 2 (installation of base framework for RAG solution and preparation for data loading)`

			`- [x] Install llamaindex as base framework for RAG solution.`
File extensions and libraries for llamaindex 2026-02-04 01:02:21 +03:00			- [x] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
			- [x] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)
Start of work on Llamaindex framework 2026-02-04 00:49:45 +03:00
			`# Phase 3 (preparation for storing data in the vector storage + embeddings)`
Vector storage Qdrant initialization and configuration 2026-02-04 01:10:07 +03:00			`- [x] Install needed library for using Qdrant connection as vector storage. Ensure ports are used (which are needed in the chosen framework): Rest Api: 6333, gRPC Api: 6334. Database available and running on localhost.`
			- [x] Create file called `vector_storage.py`, which will contain vector storage initialization, available for import by other modules of initialized. If needed in chosen RAG framework, add embedding model initialization in the same file. Use ollama, model name defined in the .env file: OLLAMA_EMBEDDING_MODEL. Ollama available by the default local port: 11434.
			`- [x] Add strategy of creating collection for this project (name: "documents_llamaindex"), if it does not exist.`
			`- [x] Just in case add possibility to connect via openai embedding, using openrouter api key. Comment this section, so it can be used in the future.`
Start of work on Llamaindex framework 2026-02-04 00:49:45 +03:00
			`# Phase 4 (creating module for loading documents from the folder)`

Enrichment for llamaindex. It goes for a long time using local model, so better use external model not local, for EMBEDDING 2026-02-04 16:06:01 +03:00			- [x] Create file `enrichment.py` with the function that will load data with configured data loaders for extensions from the data folder into the chosen vector storage. Remember to specify default embeddings meta properties, such as filename, paragraph, page, section, wherever this is possible (documents can have pages, sections, paragraphs, etc). Use text splitters of the chosen RAG framework accordingly to the documents being loaded. Which chunking/text-splitting strategies framework has, can be learned online.
			- [x] Use built-in strategy for marking which documents loaded (if there is such mechanism) and which are not, to avoid re-reading and re-encriching vector storage with the existing data. If there is no built-in mechanism of this type, install sqlite library and use local sqlite database file `document_tracking.db` to store this information. Important: mark documents as read and processed ONLY when they were stored in the vector storage, to avoid marked documents being ignored when they in fact were not yet been inserted and processed.
			`- [x] Add activation of this function in the cli entrypoint, as a command.`
Start of work on Llamaindex framework 2026-02-04 00:49:45 +03:00
			`# Phase 5 (preparation for the retrieval feature)`

Retrieval and also update on russian language 2026-02-04 16:51:50 +03:00			- [x] Create file `retrieval.py` with the configuration for chosen RAG framework, that will retrieve data from the vector storage based on the query. Use retrieving library/plugin, that supports chosen vector storage within the chosen RAG framework. Retrieving configuration should search for the provided text in the query as argument in the function and return found information with the stored meta data, like paragraph, section, page etc. Important: if for chosen RAG framework, there is no need in separation of search, separation of retrieving from the chosen vector storage, this step may be skipped and marked done.
Start of work on Llamaindex framework 2026-02-04 00:49:45 +03:00
env step for llamaindex 2026-02-05 22:48:39 +03:00			`# Phase 6 (models strategy, loading env and update on using openai models)`

llamaindex update + unpacking archives in data 2026-02-09 19:00:23 +03:00			- [x] Add `CHAT_STRATEGY`, `EMBEDDING_STRATEGY` fields to .env, possible values are "openai" or "ollama".
			- [x] Add `OPENAI_CHAT_URL`, `OPENAI_CHAT_KEY`, `OPENAI_EMBEDDING_MODEL`, `OPENAI_EMBEDDING_BASE_URL`, `OPENAI_EMBEDDING_API_KEY` values to .env.dist with dummy values and to .env with dummy values.
			`- [x] Add in all important .env wise places in the code loading .env file for it's variables`
			`- [x] Create reusable function, that will return configuration for models. It will check CHAT_STRATEGY and load environment variables accordingly, and return config for usage.`
			`- [x] Add this function everywhere in the codebase where chat or embedding models configuration needed`
Start of work on Llamaindex framework 2026-02-04 00:49:45 +03:00
llamaindex update + unpacking archives in data 2026-02-09 19:00:23 +03:00			`# Phase 7 (explicit logging and progressbar)`

			`- [x] Add log of how many files currently being processed in enrichment. We need to see how many total to process and how many processed each time new document being processed. If it's possible, also add progressbar showing percentage and those numbers on top of logs.`

Prefect client prep for langchain 2026-02-16 15:12:44 +03:00			`# Phase 8 (comment unsupported formats for now)`
llamaindex update + unpacking archives in data 2026-02-09 19:00:23 +03:00
ragflow in the repository, with codex-created yandex disk plugin JUST IN CASE, also llamaindex enrichment with yandex disk predefined data 2026-02-25 11:28:29 +03:00			`- [x] Remove for now formats, extensions for images of any kind, archives of any kind, and add possible text documents, documents formats, like .txt, .xlsx, etc. in enrichment processes/functions.`
Prefect client prep for langchain 2026-02-16 15:12:44 +03:00
			`# Phase 9 (integration of Prefect client, for creating flow and tasks on remote Prefect server)`

ragflow in the repository, with codex-created yandex disk plugin JUST IN CASE, also llamaindex enrichment with yandex disk predefined data 2026-02-25 11:28:29 +03:00			`- [x] Install Prefect client library.`
			`- [x] Add .env variable PREFECT_API_URL, that will be used for connecting client to the prefect server`
			- [x] Create prefect client file in `prefect/01_yadisk_predefined_enrich.py`. This file will firt load file from ./../../../yadisk_files.json into array of paths. After that, array of paths will be filtered, and only supported in enrichment extensions will be left. After that, code will iterate through each path in this filtered array, use yadisk library to download file, process it for enrichment, and the remove it after processing. There should be statistics for this, at runtime, with progressbar that shows how many files processed out of how many left. Also, near the progressbar there should be counter of errors. Yes, if there is an error, it should be swallowed, even if it is inside thred or async function.
			`- [x] For yandex disk integration use library yadisk. In .env file there should be variable YADISK_TOKEN for accessing the needed connection`
			`- [x] Code for loading should be reflected upon, and then made it so it would be done in async way, with as much as possible simulatenous tasks. yadisk async integration should be used (async features can be checked here: https://pypi.org/project/yadisk/)`
			`- [x] No tests for code should be done at this phase, all tests will be done manually, because loading of documents can take a long time for automated test.`

			`# Phase 10 (qdrant connection credentials in .env)`

			`- [x] Add Qdrant connection variables to the .env file: QDRANT_HOST, QDRANT_REST_PORT, QDRANT_GRPC_PORT`
			`- [x] Replace everywhere where Qdran connection used hardcoded values into the usage of Qdrant .env variables`

			`# Phase 11 (http endpoint to retrieve data from the vector storage by query)`

langchain uploading new way of predefined paths from yandex disk 2026-02-26 00:01:47 +03:00			- [x] Create file `server.py`, with web framework fastapi, for example
			`- [x] Add POST endpoint "/api/test-query" which will use agent, and retrieve response for query, sent in JSON format, field "query"`
more sophisticated chat like retrieval for llamaindex 2026-02-26 19:02:05 +03:00
			`# Phase 12 (upgrade from simple retrieval to agent-like chat in LlamaIndex)`

			`- [x] Revisit Phase 5 assumption ("simple retrieval only") and explicitly allow agent/chat orchestration in LlamaIndex for QA over documents.`
			- [x] Create new module for chat orchestration (for example `agent.py` or `chat_engine.py`) that separates:
			`1) retrieval of source nodes`
			`2) answer synthesis with explicit prompt`
			`3) response formatting with sources/metadata`
			`- [x] Implement a LlamaIndex-based chat feature (agent-like behavior) using framework-native primitives (chat engine / agent workflow / tool-calling approach supported by installed version), so the model can iteratively query retrieval tools when needed.`
			- [x] Add a retrieval tool wrapper for document search that returns structured snippets (`filename`, `file_path`, `page_label/page`, `chunk_number`, content preview, score) instead of raw text only.
			`- [x] Add a grounded answer prompt/template for the LlamaIndex chat path with rules:`
			`- answer only from retrieved context`
			`- if information is missing, say so directly`
			`- prefer exact dates/years and quote filenames/pages where possible`
			`- avoid generic claims not supported by sources`
			`- [x] Add response mode that returns both:`
			`- final answer text`
			`- list of retrieved sources (content snippet + metadata + score)`
			`- [x] Add post-processing for retrieved nodes before synthesis:`
			`- deduplicate near-identical chunks`
			`- drop empty / near-empty chunks`
			`- optionally filter low-information chunks (headers/footers)`
			`- [x] Add optional metadata-aware retrieval improvements (years/events/keywords) parity with LangChain approach (folder near current folder), if feasible in the chosen LlamaIndex primitives.`
			- [x] Update `server.py` endpoint to use the new agent-like chat path (keep simple retrieval endpoint available as fallback or debug mode).