Adaptive Collection, and Phase 11 WIP

This commit is contained in:
2026-02-10 20:12:43 +03:00
parent 447ecaba39
commit 63c3e2c5c7
4 changed files with 56 additions and 2 deletions

BIN
.DS_Store vendored

Binary file not shown.

View File

@@ -6,3 +6,4 @@ CHAT_MODEL_STRATEGY=ollama
QDRANT_HOST=HOST
QDRANT_REST_PORT=PORT
QDRANT_GRPC_PORT=PORT
YADISK_TOKEN=TOKEN

View File

@@ -65,3 +65,11 @@ Chosen data folder: relatve ./../../../data - from the current folder
- [x] Create heuristic, regex function in helpers module for extracting name of event, in Russian language. We need to use regex and possible words before, after the event, etc.
- [x] Durint enriching vector storage, try to extract event name from the chunk and save in metadata in field "events", which will contain list of strings, possible evennts. Helper function usage is advised.
- [x] In VectorStoreRetriever._get_relevant_documents add similarity search for the event name, if event name is present in the query. Helper function should be used here to try to extract the event name.
# Phase 11 (adaptive collection, to attach different filesystems in the future)
- [x] Create adaptive collection class and adaptive file class in the helpers, which will be as abstract classes, that should encompass feature of iterating and working with files locally
- [x] Write local filesystem implementation of adaptive collection
- [ ] Write tests for local filesystem implementation, using test/samples folder filled with files and directories for testing of iteration and recursivess
- [ ] Create Yandex Disk implementation of the Adaptive Collection. Constructor should have requirement for TOKEN for Yandex Disk.
- [ ] Write tests for Yandex Disk implementation, using folder "Общая/Информация". .env has YADISK_TOKEN variable for connecting. While testing log output of found files during iterating. If test fails at this step, leave to manual fixing, and this step can be marked as done.

View File

@@ -1,8 +1,10 @@
"""Helper utilities for metadata extraction from Russian text."""
import os
import re
from typing import List
from abc import abstractmethod
from pathlib import Path
from typing import Callable, List
_YEAR_PATTERN = re.compile(r"(?<!\d)(1\d{3}|20\d{2}|2100)(?!\d)")
@@ -105,3 +107,46 @@ def extract_russian_event_names(text: str) -> List[str]:
seen.add(quoted)
return events
class _AdaptiveFile:
extension: str # Format: .jpg
local_path: str
def __init__(self, extension: str, local_path: str):
self.extension = extension
self.local_path = local_path
# This method allows to work with file locally, and lambda should be provided for this.
# Why separate method? For possible cleanup after work is done. And to download file, if needed
# Lambda: first argument is a local path
@abstractmethod
def work_with_file_locally(self, func: Callable[[str], None]):
pass
class _AdaptiveCollection:
# Generator method with yield
@abstractmethod
def iterate(self, recursive: bool):
pass
class LocalFilesystemAdaptiveFile(_AdaptiveFile):
def work_with_file_locally(self, func: Callable[[str], None]):
func(self.local_path)
class LocalFilesystemAdaptiveCollection(_AdaptiveCollection):
base_dir: str
def __init__(self, base_dir: str):
super().__init__()
self.base_dir = base_dir
def iterate(self, recursive: bool):
for root, dirs, files in os.walk(self.base_dir):
for file in files:
full_path = os.path.join(root, file)
yield _AdaptiveFile(Path(full_path).suffix, full_path)