Artur Mukhamadiev 65fccbc614 feat(storage): implement hybrid search and fix async chroma i/o

- Add ADR 001 for Hybrid Search Architecture
- Implement Phase 1 (Exact Match) and Phase 2 (Semantic Fallback) in ChromaStore
- Wrap blocking ChromaDB calls in asyncio.to_thread
- Update IVectorStore interface to support category filtering and thresholds
- Add comprehensive tests for hybrid search logic

2026-03-16 00:11:07 +03:00

6.2 KiB

Raw Blame History

ADR 001: Architecture Design for Enhanced Semantic & Hybrid Search

1. Context and Problem Statement

The "Trend-Scout AI" bot currently utilizes a basic synchronous implementation of ChromaDB to fulfill both categorical retrieval (/latest) and free-text queries (/search). Two major issues have severely impacted the user experience:

Incorrect Categories in /latest: The system performs a dense vector search using the requested category name (e.g., "AI") rather than a deterministic exact match. This returns semantically related news regardless of their actual assigned category, yielding false positives.
Poor Semantic Matches in /search:
- The default English-centric embedding model (e.g., all-MiniLM-L6-v2) handles Russian summaries and specialized technical acronyms poorly.
- Pure vector search ignores exact keyword matches, leading to frustrated user expectations when searching for specific entities (e.g., "OpenAI o1" or specific version numbers).
Blocking I/O operations: The ChromaStore executes blocking synchronous operations within async def wrappers, potentially starving the asyncio event loop and violating asynchronous data flow requirements.

2. Decision Drivers

Accuracy & Relevance: Strict categorization and high recall for exact keywords + conceptual similarity.
Multilingual Support: Strong performance on both English source texts and Russian summaries.
Performance & Concurrency: Fully non-blocking (async) operations.
Adherence to SOLID: Maintain strict interface boundaries, dependency inversion, and existing Domain Transfer Objects (DTOs).
Alignment with Agent Architecture: Ensure the Vector Storage Agent focuses strictly on storage/retrieval coordination without leaking AI processing duties.

3. Proposed Architecture

3.1. Asynchronous Data Flow (I/O)

Decision: Migrate the local ChromaDB calls to run in a thread pool executor. Alternatively, if ChromaDB is hosted as a standalone server, utilize chromadb.AsyncHttpClient.
Implementation: Encapsulate blocking calls like self.collection.upsert() and self.collection.query() inside asyncio.to_thread() to prevent blocking the Telegram bot's main event loop.

3.2. Interface Segregation (ISP) for Storage

The current IVectorStore interface conflates generic vector searching, exact categorical retrieval, and database administration.

Action: Segregate the interfaces to adhere to ISP.

Refactored Interfaces:

class IStoreCommand(ABC):
    @abstractmethod
    async def store(self, item: EnrichedNewsItemDTO) -> None: ...

class IStoreQuery(ABC):
    @abstractmethod
    async def search_hybrid(self, query: str, limit: int = 5) -> List[EnrichedNewsItemDTO]: ...

    @abstractmethod
    async def get_latest_by_category(self, category: Optional[str], limit: int = 10) -> List[EnrichedNewsItemDTO]: ...

    @abstractmethod
    async def get_top_ranked(self, limit: int = 10) -> List[EnrichedNewsItemDTO]: ...

3.3. Strict Metadata Filtering for `/latest`

Mechanism: The /latest command must completely bypass vector similarity search. Instead, it will use ChromaDB's .get() method coupled with a strict where metadata filter: where={"category": {"$eq": category}}.
Sorting Architecture: Because ChromaDB does not natively support sorting results by a metadata field (like timestamp), the get_latest_by_category method will over-fetch (e.g., fetch up to 100 recent items using the metadata filter) and perform a fast, deterministic in-memory sort by timestamp descending before slicing to the requested limit.

3.4. Hybrid Search Architecture (Keyword + Vector)

Mechanism: Implement a Hybrid Search Strategy utilizing Reciprocal Rank Fusion (RRF).
Sparse Retrieval (Keyword): Integrate a lightweight keyword index alongside ChromaDB. Given the bot's scale, SQLite FTS5 (Full-Text Search) is the optimal choice. It provides persistent, fast token matching without the overhead of Elasticsearch.
Dense Retrieval (Vector): ChromaDB semantic search.
Fusion Strategy:
1. The new HybridSearchStrategy issues queries to both the SQLite FTS index and ChromaDB concurrently using asyncio.gather.
2. The results are normalized using the RRF formula: Score = 1 / (k + rank_sparse) + 1 / (k + rank_dense) (where k is typically 60).
3. The combined list of DTOs is sorted by the fused score and returned.

3.5. Embedding Model Evaluation & Upgrade

Decision: Replace the default ChromaDB embedding function with a dedicated, explicitly configured multilingual model.
Recommendation: Utilize intfloat/multilingual-e5-small (for lightweight CPU environments) or sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. Both provide excellent English-Russian cross-lingual semantic alignment.
Integration (DIP): Apply the Dependency Inversion Principle by injecting the embedding function (or an IEmbeddingProvider interface) into the ChromaStore constructor. This allows for seamless A/B testing of embedding models without touching the core storage logic.

4. Application to the Agent Architecture

Vector Storage Agent (Database): This agent's responsibility shifts from "pure vector storage" to "Hybrid Storage Management." It coordinates the ChromaStore (Dense) and SQLiteStore (Sparse) implementations.
AI Processor Agent: To maintain Single Responsibility (SRP), embedding generation can be shifted from the storage layer to the AI Processor Agent. The AI Processor generates the vector using an Ollama hosted embedding model and attaches it directly to the EnrichedNewsItemDTO. The Storage Agent simply stores the pre-calculated vector, drastically reducing the dependency weight of the storage module.

5. Next Steps for Implementation

Add sqlite3 FTS5 table initialization to the project scaffolding.
Refactor src/storage/base.py to segregate IStoreQuery and IStoreCommand.
Update ChromaStore to accept pre-calculated embeddings and utilize asyncio.to_thread.
Implement the RRF sorting algorithm in a new search_hybrid pipeline.
Update src/bot/handlers.py to route /latest through get_latest_by_category.

6.2 KiB Raw Blame History