ragflow in the repository, with codex-created yandex disk plugin JUST IN CASE, also llamaindex enrichment with yandex disk predefined data

2026-02-25 11:28:29 +03:00
parent c29928cc89
commit 2c7ab06b3f
12 changed files with 98507 additions and 132 deletions
--- a/services/rag/llamaindex/PLANNING.md
+++ b/services/rag/llamaindex/PLANNING.md
@@ -49,10 +49,23 @@ Chosen data folder: relatve ./../../../data - from the current folder

 # Phase 8 (comment unsupported formats for now)

- [ ] Remove for now formats, extensions for images of any kind, archives of any kind, and add possible text documents, documents formats, like .txt, .xlsx, etc.
+- [x] Remove for now formats, extensions for images of any kind, archives of any kind, and add possible text documents, documents formats, like .txt, .xlsx, etc. in enrichment processes/functions.

 # Phase 9 (integration of Prefect client, for creating flow and tasks on remote Prefect server)

- [ ] Install Prefect client library.
- [ ] Add .env variable PREFECT_API_URL, that will be used for connecting client to the prefect server
- [ ] Create
+- [x] Install Prefect client library.
+- [x] Add .env variable PREFECT_API_URL, that will be used for connecting client to the prefect server
+- [x] Create prefect client file in `prefect/01_yadisk_predefined_enrich.py`. This file will firt load file from ./../../../yadisk_files.json into array of paths. After that, array of paths will be filtered, and only supported in enrichment extensions will be left. After that, code will iterate through each path in this filtered array, use yadisk library to download file, process it for enrichment, and the remove it after processing. There should be statistics for this, at runtime, with progressbar that shows how many files processed out of how many left. Also, near the progressbar there should be counter of errors. Yes, if there is an error, it should be swallowed, even if it is inside thred or async function.
+- [x] For yandex disk integration use library yadisk. In .env file there should be variable YADISK_TOKEN for accessing the needed connection
+- [x] Code for loading should be reflected upon, and then made it so it would be done in async way, with as much as possible simulatenous tasks. yadisk async integration should be used (async features can be checked here: https://pypi.org/project/yadisk/)
+- [x] No tests for code should be done at this phase, all tests will be done manually, because loading of documents can take a long time for automated test.
+
+# Phase 10 (qdrant connection credentials in .env)
+
+- [x] Add Qdrant connection variables to the .env file: QDRANT_HOST, QDRANT_REST_PORT, QDRANT_GRPC_PORT
+- [x] Replace everywhere where Qdran connection used hardcoded values into the usage of Qdrant .env variables
+
+# Phase 11 (http endpoint to retrieve data from the vector storage by query)
+
+- [ ] Create file `server.py`, with web framework fastapi, for example
+- [ ] Add POST endpoint "/api/test-query" which will use agent, and retrieve response for query, sent in JSON format, field "query"