Adaptive Collection, and Phase 11 WIP
This commit is contained in:
@@ -6,3 +6,4 @@ CHAT_MODEL_STRATEGY=ollama
|
||||
QDRANT_HOST=HOST
|
||||
QDRANT_REST_PORT=PORT
|
||||
QDRANT_GRPC_PORT=PORT
|
||||
YADISK_TOKEN=TOKEN
|
||||
|
||||
@@ -65,3 +65,11 @@ Chosen data folder: relatve ./../../../data - from the current folder
|
||||
- [x] Create heuristic, regex function in helpers module for extracting name of event, in Russian language. We need to use regex and possible words before, after the event, etc.
|
||||
- [x] Durint enriching vector storage, try to extract event name from the chunk and save in metadata in field "events", which will contain list of strings, possible evennts. Helper function usage is advised.
|
||||
- [x] In VectorStoreRetriever._get_relevant_documents add similarity search for the event name, if event name is present in the query. Helper function should be used here to try to extract the event name.
|
||||
|
||||
# Phase 11 (adaptive collection, to attach different filesystems in the future)
|
||||
|
||||
- [x] Create adaptive collection class and adaptive file class in the helpers, which will be as abstract classes, that should encompass feature of iterating and working with files locally
|
||||
- [x] Write local filesystem implementation of adaptive collection
|
||||
- [ ] Write tests for local filesystem implementation, using test/samples folder filled with files and directories for testing of iteration and recursivess
|
||||
- [ ] Create Yandex Disk implementation of the Adaptive Collection. Constructor should have requirement for TOKEN for Yandex Disk.
|
||||
- [ ] Write tests for Yandex Disk implementation, using folder "Общая/Информация". .env has YADISK_TOKEN variable for connecting. While testing log output of found files during iterating. If test fails at this step, leave to manual fixing, and this step can be marked as done.
|
||||
|
||||
@@ -1,8 +1,10 @@
|
||||
"""Helper utilities for metadata extraction from Russian text."""
|
||||
|
||||
import os
|
||||
import re
|
||||
from typing import List
|
||||
|
||||
from abc import abstractmethod
|
||||
from pathlib import Path
|
||||
from typing import Callable, List
|
||||
|
||||
_YEAR_PATTERN = re.compile(r"(?<!\d)(1\d{3}|20\d{2}|2100)(?!\d)")
|
||||
|
||||
@@ -105,3 +107,46 @@ def extract_russian_event_names(text: str) -> List[str]:
|
||||
seen.add(quoted)
|
||||
|
||||
return events
|
||||
|
||||
|
||||
class _AdaptiveFile:
|
||||
extension: str # Format: .jpg
|
||||
local_path: str
|
||||
|
||||
def __init__(self, extension: str, local_path: str):
|
||||
self.extension = extension
|
||||
self.local_path = local_path
|
||||
|
||||
# This method allows to work with file locally, and lambda should be provided for this.
|
||||
# Why separate method? For possible cleanup after work is done. And to download file, if needed
|
||||
# Lambda: first argument is a local path
|
||||
@abstractmethod
|
||||
def work_with_file_locally(self, func: Callable[[str], None]):
|
||||
pass
|
||||
|
||||
|
||||
class _AdaptiveCollection:
|
||||
# Generator method with yield
|
||||
@abstractmethod
|
||||
def iterate(self, recursive: bool):
|
||||
pass
|
||||
|
||||
|
||||
class LocalFilesystemAdaptiveFile(_AdaptiveFile):
|
||||
def work_with_file_locally(self, func: Callable[[str], None]):
|
||||
func(self.local_path)
|
||||
|
||||
|
||||
class LocalFilesystemAdaptiveCollection(_AdaptiveCollection):
|
||||
base_dir: str
|
||||
|
||||
def __init__(self, base_dir: str):
|
||||
super().__init__()
|
||||
|
||||
self.base_dir = base_dir
|
||||
|
||||
def iterate(self, recursive: bool):
|
||||
for root, dirs, files in os.walk(self.base_dir):
|
||||
for file in files:
|
||||
full_path = os.path.join(root, file)
|
||||
yield _AdaptiveFile(Path(full_path).suffix, full_path)
|
||||
|
||||
Reference in New Issue
Block a user