- [x] Analyze the upper `data` folder (./../../../data), to learn all the possible files extensions of files there. Then, create file in the current directory `EXTENSIONS.md` with the list of extensions - and loader/loaders for chosen framework (this can be learned online - search for the info), that is needed to load the data in the provided extension. Prioriize libraries that work without external service that require API keys or paid subscriptions. Important: skip stream media files extensions (audio, video). We are not going to load them now.
- [x] Install all needed libraries for loaders, mentioned in the `EXTENSIONS.md`. If some libraries require API keys for external services, add them to the `.env` file (create it if it does not exist)
- [x] Install needed library for using Qdrant connection as vector storage. Ensure ports are used (which are needed in the chosen framework): Rest Api: 6333, gRPC Api: 6334. Database available and running on localhost.
- [x] Create file called `vector_storage.py`, which will contain vector storage initialization, available for import by other modules of initialized. If needed in chosen RAG framework, add embedding model initialization in the same file. Use ollama, model name defined in the .env file: OLLAMA_EMBEDDING_MODEL. Ollama available by the default local port: 11434.
- [x] Add strategy of creating collection for this project (name: "documents_llamaindex"), if it does not exist.
- [x] Just in case add possibility to connect via openai embedding, using openrouter api key. Comment this section, so it can be used in the future.
- [x] Create file `enrichment.py` with the function that will load data with configured data loaders for extensions from the data folder into the chosen vector storage. Remember to specify default embeddings meta properties, such as filename, paragraph, page, section, wherever this is possible (documents can have pages, sections, paragraphs, etc). Use text splitters of the chosen RAG framework accordingly to the documents being loaded. Which chunking/text-splitting strategies framework has, can be learned online.
- [x] Use built-in strategy for marking which documents loaded (if there is such mechanism) and which are not, to avoid re-reading and re-encriching vector storage with the existing data. If there is no built-in mechanism of this type, install sqlite library and use local sqlite database file `document_tracking.db` to store this information. Important: mark documents as read and processed ONLY when they were stored in the vector storage, to avoid marked documents being ignored when they in fact were not yet been inserted and processed.
- [x] Add activation of this function in the cli entrypoint, as a command.
- [x] Create file `retrieval.py` with the configuration for chosen RAG framework, that will retrieve data from the vector storage based on the query. Use retrieving library/plugin, that supports chosen vector storage within the chosen RAG framework. Retrieving configuration should search for the provided text in the query as argument in the function and return found information with the stored meta data, like paragraph, section, page etc. Important: if for chosen RAG framework, there is no need in separation of search, separation of retrieving from the chosen vector storage, this step may be skipped and marked done.
# Phase 6 (models strategy, loading env and update on using openai models)
- [ ] Add `CHAT_STRATEGY`, `EMBEDDING_STRATEGY` fields to .env, possible values are "openai" or "ollama".
- [ ] Add `OPENAI_CHAT_URL`, `OPENAI_CHAT_KEY`, `OPENAI_EMBEDDING_MODEL`, `OPENAI_EMBEDDING_BASE_URL`, `OPENAI_EMBEDDING_API_KEY` values to .env.dist with dummy values and to .env with dummy values.
- [ ] Add in all important .env wise places in the code loading .env file for it's variables
- [ ] Create reusable function, that will return configuration for models. It will check CHAT_STRATEGY and load environment variables accordingly, and return config for usage.
- [ ] Add this function everywhere in the codebase where chat or embedding models configuration needed
# Phase 7 (chat feature, as agent, for usage in the cli)
- [ ] Create file `agent.py`, which will incorporate into itself agent, powered by the chat model. It should use integration with ollama, model specified in .env in property: OLLAMA_CHAT_MODEL
- [ ] Integrate this agent with the existing solution for retrieving, with retrieval.py
- [ ] Integrate this agent with the cli, as command to start chatting with the agent. If there is a built-in solution for console communication with the agent, initiate this on cli command.