langchain uploading new way of predefined paths from yandex disk

2026-02-26 00:01:47 +03:00
parent 2c7ab06b3f
commit 3e29ea70ed
7 changed files with 365 additions and 17 deletions
--- a/services/rag/langchain/PLANNING.md
+++ b/services/rag/langchain/PLANNING.md
@@ -85,7 +85,6 @@ During enrichment, we should use adaptive collection from the helpers, for loadi
 - [x] We still will need filetypes that we will need to skip, so while iterating over files we need to check their extension and skip them.
 - [x] Adaptive files has filename in them, so it should be used when extracting metadata

-
 # Phase 13 (async processing of files)

 During this Phase we create asynchronous process of enrichment, utilizing async/await
@@ -111,3 +110,18 @@ During this Phase we create asynchronous process of enrichment, utilizing async/
 - [x] Create prefect task "iterate_yadisk_folder_and_store_file_paths" that will connect to yandex disk with yadisk library, analyze everything inside folder `Общая` recursively and store file paths in the ./../../../yadisk_files.json, in array of strings.
 - [x] In our pefect file add function for flow to serve, as per prefect documentation on serving flows
 - [x] Tests will be done manually by hand, by executing this script and checking prefect dashboard. No automatical tests needed for this phase.
+
+# Phase 15 (prefect enrichment process for langchain, with predefined values, also removal of non-documet formats)
+
+- [x] Remove for now formats, extensions for images of any kind, archives of any kind, and add possible text documents, documents formats, like .txt, .xlsx, etc. in enrichment processes/functions.
+- [x] Create prefect client file in `prefect/02_yadisk_predefined_enrich.py`. This file will firt load file from ./../../../yadisk_files.json into array of paths. After that, array of paths will be filtered, and only supported in enrichment extensions will be left. After that, code will iterate through each path in this filtered array, use yadisk library to download file, process it for enrichment, and the remove it after processing. There should be statistics for this, at runtime, with progressbar that shows how many files processed out of how many left. Also, near the progressbar there should be counter of errors. Yes, if there is an error, it should be swallowed, even if it is inside thred or async function.
+- [x] For yandex disk integration use library yadisk. In .env file there should be variable YADISK_TOKEN for accessing the needed connection
+- [x] Code for loading should be reflected upon, and then made it so it would be done in async way, with as much as possible simulatenous tasks. yadisk async integration should be used (async features can be checked here: https://pypi.org/project/yadisk/)
+- [x] No tests for code should be done at this phase, all tests will be done manually, because loading of documents can take a long time for automated test.
+
+# Phase 16 (making demo ui scalable)
+
+- [ ] Make demo-ui window containable and reusable part of html + js. This part will be used for creating multi-windowed demo ui.
+- [ ] Make tabbed UI with top level tabs. First tab exists and is selected. Each tab should have copy of demo ui, meaning the chat window with ability to specify the api url
+- [ ] At the end of the tabs there should be button with plus sign, which will add new tab. Tabs to be called by numbers.
+- [ ] There should predefined 3 tabs opened. First one should have predefined api url "https://rag.langchain.overwatch.su/api/test-query", second "https://rag.llamaindex.overwatch.su/api/test-query", third "https://rag.haystack.overwatch.su/api/test-query"