# Trend-Scout AI: Database Schema Migration Plan **Document Version:** 1.0 **Date:** 2026-03-30 **Status:** Draft for Review --- ## 1. Executive Summary This document outlines a comprehensive migration plan to address critical performance and data architecture issues in the Trend-Scout AI vector storage layer. The migration transforms a flat, unnormalized ChromaDB collection into a normalized, multi-collection architecture with proper indexing, pagination, and query optimization. ### Current Pain Points | Issue | Current Behavior | Impact | |-------|------------------|--------| | `get_latest` | Fetches ALL items via `collection.get()`, sorts in Python | O(n) memory + sort | | `get_top_ranked` | Fetches ALL items via `collection.get()`, sorts in Python | O(n) memory + sort | | `get_stats` | Iterates over ALL items to count categories | O(n) iteration | | `anomalies_detected` | Stored as comma-joined string `"A1,A2,A3"` | Type corruption, no index | | Single collection | No normalization, mixed data types | No efficient filtering | | No pagination | No offset-based pagination support | Cannot handle large datasets | ### Expected Outcomes | Metric | Current | Target | Improvement | |--------|---------|--------|-------------| | `get_latest` (1000 items) | ~500ms, O(n) | ~10ms, O(log n) | **50x faster** | | `get_top_ranked` (1000 items) | ~500ms, O(n log n) | ~10ms, O(k log k) | **50x faster** | | `get_stats` (1000 items) | ~200ms | ~1ms | **200x faster** | | Memory usage | O(n) full scan | O(1) or O(k) | **~99% reduction** | --- ## 2. Target Architecture ### 2.1 Multi-Collection Design ``` ┌─────────────────────────────────────────────────────────────────────┐ │ ChromaDB Instance │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ │ news_items │ │ category_index │ │ │ │ (main collection) │ │ (lookup table) │ │ │ ├──────────────────────┤ ├──────────────────────┤ │ │ │ id (uuid5 from url) │◄───┤ category_id │ │ │ │ content_text (vec) │ │ name │ │ │ │ title │ │ item_count │ │ │ │ url │ │ created_at │ │ │ │ source │ │ updated_at │ │ │ │ timestamp │ └──────────────────────┘ │ │ │ relevance_score │ │ │ │ summary_ru │ ┌──────────────────────┐ │ │ │ category_id (FK) │───►│ anomaly_types │ │ │ │ anomalies[] │ │ (normalized) │ │ │ └──────────────────────┘ ├──────────────────────┤ │ │ │ anomaly_id │ │ │ │ name │ │ │ │ description │ │ │ └──────────────────────┘ │ │ │ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ │ news_anomalies │ │ stats_cache │ │ │ │ (junction table) │ │ (materialized) │ │ │ ├──────────────────────┤ ├──────────────────────┤ │ │ │ news_id (FK) │◄───┤ key │ │ │ │ anomaly_id (FK) │ │ total_count │ │ │ │ detected_at │ │ category_counts (JSON)│ │ │ └──────────────────────┘ │ source_counts (JSON) │ │ │ │ last_updated │ │ │ └──────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### 2.2 SQLite Shadow Database (for FTS and relational queries) ``` ┌─────────────────────────────────────────────────────────────────────┐ │ SQLite Database │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ │ news_items_rel │ │ categories │ │ │ │ (normalized relay) │ ├──────────────────────┤ │ │ ├──────────────────────┤ │ id (PK) │ │ │ │ id (uuid, PK) │ │ name │ │ │ │ title │ │ item_count │ │ │ │ url (UNIQUE) │ │ embedding_id (FK) │ │ │ │ source │ └──────────────────────┘ │ │ │ timestamp │ │ │ │ relevance_score │ ┌──────────────────────┐ │ │ │ summary_ru │ │ anomalies │ │ │ │ category_id (FK) │ ├──────────────────────┤ │ │ │ content_text │ │ id (PK) │ │ │ └──────────────────────┘ │ name │ │ │ │ description │ │ │ ┌──────────────────────┐ └──────────────────────┘ │ │ │ news_anomalies_rel │ │ │ ├──────────────────────┤ ┌──────────────────────┐ │ │ │ news_id (FK) │ │ stats_snapshot │ │ │ │ anomaly_id (FK) │ ├──────────────────────┤ │ │ │ detected_at │ │ total_count │ │ │ └──────────────────────┘ │ category_json │ │ │ │ last_updated │ │ │ ┌──────────────────────┐ └──────────────────────┘ │ │ │ news_fts │ │ │ │ (FTS5 virtual) │ │ │ ├──────────────────────┤ ┌──────────────────────┐ │ │ │ news_id (FK) │ │ crawl_history │ │ │ │ title_tokens │ ├──────────────────────┤ │ │ │ content_tokens │ │ id (PK) │ │ │ │ summary_tokens │ │ crawler_name │ │ │ └──────────────────────┘ │ items_fetched │ │ │ │ items_new │ │ │ │ started_at │ │ │ │ completed_at │ │ │ └──────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### 2.3 New DTOs ```python # src/processor/dto.py (evolved) from typing import List, Optional from pydantic import BaseModel, Field from datetime import datetime from enum import Enum class AnomalyType(str, Enum): """Normalized anomaly types - not a comma-joined string.""" WEBGPU = "WebGPU" NPU_ACCELERATION = "NPU acceleration" EDGE_AI = "Edge AI" # ... extensible class EnrichedNewsItemDTO(BaseModel): """Extended DTO with proper normalized relationships.""" title: str url: str content_text: str source: str timestamp: datetime relevance_score: int = Field(ge=0, le=10) summary_ru: str category: str # Category name, resolved from category_id anomalies_detected: List[AnomalyType] # Now a proper list anomaly_ids: Optional[List[str]] = None # Internal use class CategoryStats(BaseModel): """Statistics per category.""" category: str count: int avg_relevance: float class StorageStats(BaseModel): """Complete storage statistics.""" total_count: int categories: List[CategoryStats] sources: Dict[str, int] anomaly_types: Dict[str, int] last_updated: datetime ### 2.4 Evolved Interface Design ```python # src/storage/base.py from abc import ABC, abstractmethod from typing import List, Optional, AsyncIterator, Dict, Any from datetime import datetime from src.processor.dto import EnrichedNewsItemDTO, StorageStats, CategoryStats class IStoreCommand(ABC): """Write operations - SRP: only handles writes.""" @abstractmethod async def store(self, item: EnrichedNewsItemDTO) -> str: """Store an item. Returns the generated ID.""" pass @abstractmethod async def store_batch(self, items: List[EnrichedNewsItemDTO]) -> List[str]: """Batch store items. Returns generated IDs.""" pass @abstractmethod async def update(self, item_id: str, item: EnrichedNewsItemDTO) -> None: """Update an existing item.""" pass @abstractmethod async def delete(self, item_id: str) -> None: """Delete an item by ID.""" pass class IStoreQuery(ABC): """Read operations - SRP: only handles queries.""" @abstractmethod async def get_by_id(self, item_id: str) -> Optional[EnrichedNewsItemDTO]: pass @abstractmethod async def exists(self, url: str) -> bool: pass @abstractmethod async def get_latest( self, limit: int = 10, category: Optional[str] = None, offset: int = 0 ) -> List[EnrichedNewsItemDTO]: """Paginated retrieval by timestamp. Uses index, not full scan.""" pass @abstractmethod async def get_top_ranked( self, limit: int = 10, category: Optional[str] = None, offset: int = 0 ) -> List[EnrichedNewsItemDTO]: """Paginated retrieval by relevance_score. Uses index, not full scan.""" pass @abstractmethod async def get_stats(self, use_cache: bool = True) -> StorageStats: """Returns cached stats or computes on-demand.""" pass @abstractmethod async def search_hybrid( self, query: str, limit: int = 5, category: Optional[str] = None, threshold: Optional[float] = None ) -> List[EnrichedNewsItemDTO]: """Keyword + semantic hybrid search with RRF fusion.""" pass @abstractmethod async def search_stream( self, query: str, limit: int = 5, category: Optional[str] = None ) -> AsyncIterator[EnrichedNewsItemDTO]: """Streaming search for large result sets.""" pass class IVectorStore(IStoreCommand, IStoreQuery): """Combined interface preserving backward compatibility.""" pass class IAdminOperations(ABC): """Administrative operations - separate from core CRUD.""" @abstractmethod async def rebuild_stats_cache(self) -> StorageStats: """Force rebuild of statistics cache.""" pass @abstractmethod async def vacuum(self) -> None: """Optimize storage.""" pass @abstractmethod async def get_health(self) -> Dict[str, Any]: """Health check for monitoring.""" pass ``` --- ## 3. Migration Strategy ### 3.1 Phase 0: Preparation (Week 1) **Objective:** Establish infrastructure for zero-downtime migration. #### 3.1.1 Create Feature Flags System ```python # src/config/feature_flags.py from enum import Flag, auto class FeatureFlags(Flag): """Feature flags for phased migration.""" NORMALIZED_STORAGE = auto() SQLITE_SHADOW_DB = auto() FTS_ENABLED = auto() STATS_CACHE = auto() PAGINATION = auto() HYBRID_SEARCH = auto() # Global config FEATURE_FLAGS = FeatureFlags.NORMALIZED_STORAGE | FeatureFlags.STATS_CACHE def is_enabled(flag: FeatureFlags) -> bool: return flag in FEATURE_FLAGS ``` #### 3.1.2 Implement Shadow Write (Dual-Write Pattern) ```python # src/storage/dual_writer.py class DualWriter: """ Writes to both legacy and new storage simultaneously. Enables gradual migration with no data loss. """ def __init__( self, legacy_store: ChromaStore, normalized_store: 'NormalizedChromaStore', sqlite_store: 'SQLiteStore' ): self.legacy = legacy_store self.normalized = normalized_store self.sqlite = sqlite_store async def store(self, item: EnrichedNewsItemDTO) -> str: # Write to both systems legacy_id = await self.legacy.store(item) normalized_id = await self.normalized.store(item) sqlite_id = await self.sqlite.store_relational(item) # Return normalized ID as source of truth return normalized_id ``` #### 3.1.3 Create Migration Test Suite ```python # tests/migrations/test_normalized_storage.py import pytest from datetime import datetime, timezone from src.processor.dto import EnrichedNewsItemDTO from src.storage.normalized_chroma_store import NormalizedChromaStore @pytest.mark.asyncio async def test_migration_preserves_data_integrity(): """ Test that normalized storage preserves all data from legacy format. Round-trip: Legacy DTO -> Normalized Storage -> Legacy DTO """ original = EnrichedNewsItemDTO( title="Test", url="https://example.com/test", content_text="Content", source="TestSource", timestamp=datetime.now(timezone.utc), relevance_score=8, summary_ru="Резюме", anomalies_detected=["WebGPU", "Edge AI"], category="Tech" ) store = NormalizedChromaStore(...) item_id = await store.store(original) retrieved = await store.get_by_id(item_id) assert retrieved.title == original.title assert retrieved.url == original.url assert retrieved.relevance_score == original.relevance_score assert retrieved.category == original.category assert set(retrieved.anomalies_detected) == set(original.anomalies_detected) @pytest.mark.asyncio async def test_anomaly_normalization(): """ Test that anomalies are properly normalized to enum values. """ dto = EnrichedNewsItemDTO( ..., anomalies_detected=["webgpu", "NPU acceleration", "Edge AI"] ) store = NormalizedChromaStore(...) item_id = await store.store(dto) retrieved = await store.get_by_id(item_id) # Should be normalized to canonical forms assert AnomalyType.WEBGPU in retrieved.anomalies_detected assert AnomalyType.NPU_ACCELERATION in retrieved.anomalies_detected assert AnomalyType.EDGE_AI in retrieved.anomalies_detected ``` ### 3.2 Phase 1: Normalized Storage (Week 2) **Objective:** Introduce normalized anomaly storage without breaking existing queries. #### 3.2.1 New Normalized ChromaStore ```python # src/storage/normalized_chroma_store.py class NormalizedChromaStore(IVectorStore): """ ChromaDB storage with normalized anomaly types. Maintains backward compatibility via IVectorStore interface. """ # Collections NEWS_COLLECTION = "news_items" ANOMALY_COLLECTION = "anomaly_types" JUNCTION_COLLECTION = "news_anomalies" async def store(self, item: EnrichedNewsItemDTO) -> str: """Store with normalized anomaly handling.""" anomaly_ids = await self._ensure_anomalies(item.anomalies_detected) # Store main item doc_id = str(uuid.uuid5(uuid.NAMESPACE_URL, item.url)) metadata = { "title": item.title, "url": item.url, "source": item.source, "timestamp": item.timestamp.isoformat(), "relevance_score": item.relevance_score, "summary_ru": item.summary_ru, "category": item.category, # NO MORE COMMA-JOINED STRING } # Store junction records for anomalies for anomaly_id in anomaly_ids: await self._store_junction(doc_id, anomaly_id) # Vector storage (without anomalies in metadata) await asyncio.to_thread( self.collection.upsert, ids=[doc_id], documents=[item.content_text], metadatas=[metadata] ) return doc_id async def _ensure_anomalies( self, anomalies: List[str] ) -> List[str]: """Ensure anomaly types exist, return IDs.""" anomaly_ids = [] for anomaly in anomalies: # Normalize to canonical form normalized = AnomalyType.from_string(anomaly) anomaly_id = str(uuid.uuid5(uuid.NAMESPACE_URL, normalized.value)) # Upsert anomaly type await asyncio.to_thread( self.anomaly_collection.upsert, ids=[anomaly_id], documents=[normalized.value], metadatas=[{ "name": normalized.value, "description": normalized.description }] ) anomaly_ids.append(anomaly_id) return anomaly_ids ``` #### 3.2.2 AnomalyType Enum ```python # src/processor/anomaly_types.py from enum import Enum class AnomalyType(str, Enum): """Canonical anomaly types with metadata.""" WEBGPU = "WebGPU" NPU_ACCELERATION = "NPU acceleration" EDGE_AI = "Edge AI" QUANTUM_COMPUTING = "Quantum computing" NEUROMORPHIC = "Neuromorphic computing" SPATIAL_COMPUTING = "Spatial computing" UNKNOWN = "Unknown" @classmethod def from_string(cls, value: str) -> "AnomalyType": """Fuzzy matching from raw string.""" normalized = value.strip().lower() for member in cls: if member.value.lower() == normalized: return member # Try partial match for member in cls: if normalized in member.value.lower(): return member return cls.UNKNOWN @property def description(self) -> str: """Human-readable description.""" descriptions = { "WebGPU": "WebGPU graphics API or GPU compute", "NPU acceleration": "Neural Processing Unit hardware", "Edge AI": "Edge computing with AI", # ... } return descriptions.get(self.value, "") ``` ### 3.3 Phase 2: Indexed Queries (Week 3) **Objective:** Eliminate full-collection scans for `get_latest`, `get_top_ranked`, `get_stats`. #### 3.3.1 SQLite Shadow Database ```python # src/storage/sqlite_store.py import sqlite3 from pathlib import Path from typing import List, Optional, Dict, Any from contextlib import asynccontextmanager from datetime import datetime import json class SQLiteStore: """ SQLite shadow database for relational queries and FTS. Maintains sync with ChromaDB news_items collection. """ def __init__(self, db_path: Path): self.db_path = db_path self._init_schema() def _init_schema(self): """Initialize SQLite schema with proper indexes.""" with self._get_connection() as conn: conn.executescript(""" -- Main relational table CREATE TABLE IF NOT EXISTS news_items ( id TEXT PRIMARY KEY, title TEXT NOT NULL, url TEXT UNIQUE NOT NULL, source TEXT NOT NULL, timestamp INTEGER NOT NULL, -- Unix epoch relevance_score INTEGER NOT NULL, summary_ru TEXT, category TEXT NOT NULL, content_text TEXT, created_at INTEGER DEFAULT (unixepoch()) ); -- Indexes for fast queries CREATE INDEX IF NOT EXISTS idx_news_timestamp ON news_items(timestamp DESC); CREATE INDEX IF NOT EXISTS idx_news_relevance ON news_items(relevance_score DESC); CREATE INDEX IF NOT EXISTS idx_news_category ON news_items(category); CREATE INDEX IF NOT EXISTS idx_news_source ON news_items(source); -- FTS5 virtual table for full-text search CREATE VIRTUAL TABLE IF NOT EXISTS news_fts USING fts5( id, title, content_text, summary_ru, content='news_items', content_rowid='rowid' ); -- Triggers to keep FTS in sync CREATE TRIGGER IF NOT EXISTS news_fts_insert AFTER INSERT ON news_items BEGIN INSERT INTO news_fts(rowid, id, title, content_text, summary_ru) VALUES (NEW.rowid, NEW.id, NEW.title, NEW.content_text, NEW.summary_ru); END; CREATE TRIGGER IF NOT EXISTS news_fts_delete AFTER DELETE ON news_items BEGIN INSERT INTO news_fts(news_fts, rowid, id, title, content_text, summary_ru) VALUES ('delete', OLD.rowid, OLD.id, OLD.title, OLD.content_text, OLD.summary_ru); END; CREATE TRIGGER IF NOT EXISTS news_fts_update AFTER UPDATE ON news_items BEGIN INSERT INTO news_fts(news_fts, rowid, id, title, content_text, summary_ru) VALUES ('delete', OLD.rowid, OLD.id, OLD.title, OLD.content_text, OLD.summary_ru); INSERT INTO news_fts(rowid, id, title, content_text, summary_ru) VALUES (NEW.rowid, NEW.id, NEW.title, NEW.content_text, NEW.summary_ru); END; -- Stats cache table CREATE TABLE IF NOT EXISTS stats_cache ( key TEXT PRIMARY KEY DEFAULT 'main', total_count INTEGER DEFAULT 0, category_counts TEXT DEFAULT '{}', -- JSON source_counts TEXT DEFAULT '{}', -- JSON anomaly_counts TEXT DEFAULT '{}', -- JSON last_updated INTEGER ); -- Anomaly types table CREATE TABLE IF NOT EXISTS anomaly_types ( id TEXT PRIMARY KEY, name TEXT UNIQUE NOT NULL, description TEXT ); -- Junction table for news-anomaly relationships CREATE TABLE IF NOT EXISTS news_anomalies ( news_id TEXT NOT NULL, anomaly_id TEXT NOT NULL, detected_at INTEGER DEFAULT (unixepoch()), PRIMARY KEY (news_id, anomaly_id), FOREIGN KEY (news_id) REFERENCES news_items(id), FOREIGN KEY (anomaly_id) REFERENCES anomaly_types(id) ); CREATE INDEX IF NOT EXISTS idx_news_anomalies_news ON news_anomalies(news_id); CREATE INDEX IF NOT EXISTS idx_news_anomalies_anomaly ON news_anomalies(anomaly_id); -- Crawl history for monitoring CREATE TABLE IF NOT EXISTS crawl_history ( id INTEGER PRIMARY KEY AUTOINCREMENT, crawler_name TEXT NOT NULL, items_fetched INTEGER DEFAULT 0, items_new INTEGER DEFAULT 0, started_at INTEGER, completed_at INTEGER, error_message TEXT ); """) @asynccontextmanager def _get_connection(self): """Async context manager for SQLite connections.""" conn = sqlite3.connect(self.db_path) conn.row_factory = sqlite3.Row try: yield conn finally: conn.close() # --- Fast Indexed Queries --- async def get_latest_indexed( self, limit: int = 10, category: Optional[str] = None, offset: int = 0 ) -> List[Dict[str, Any]]: """Get latest items using SQLite index - O(log n + k).""" with self._get_connection() as conn: if category: cursor = conn.execute(""" SELECT * FROM news_items WHERE category = ? ORDER BY timestamp DESC LIMIT ? OFFSET ? """, (category, limit, offset)) else: cursor = conn.execute(""" SELECT * FROM news_items ORDER BY timestamp DESC LIMIT ? OFFSET ? """, (limit, offset)) return [dict(row) for row in cursor.fetchall()] async def get_top_ranked_indexed( self, limit: int = 10, category: Optional[str] = None, offset: int = 0 ) -> List[Dict[str, Any]]: """Get top ranked items using SQLite index - O(log n + k).""" with self._get_connection() as conn: if category: cursor = conn.execute(""" SELECT * FROM news_items WHERE category = ? ORDER BY relevance_score DESC, timestamp DESC LIMIT ? OFFSET ? """, (category, limit, offset)) else: cursor = conn.execute(""" SELECT * FROM news_items ORDER BY relevance_score DESC, timestamp DESC LIMIT ? OFFSET ? """, (limit, offset)) return [dict(row) for row in cursor.fetchall()] async def get_stats_fast(self) -> Dict[str, Any]: """ Get stats from cache or compute incrementally. O(1) if cached, O(1) for reads from indexed tables. """ with self._get_connection() as conn: # Try cache first cursor = conn.execute("SELECT * FROM stats_cache WHERE key = 'main'") row = cursor.fetchone() if row: return { "total_count": row["total_count"], "category_counts": json.loads(row["category_counts"]), "source_counts": json.loads(row["source_counts"]), "anomaly_counts": json.loads(row["anomaly_counts"]), "last_updated": datetime.fromtimestamp(row["last_updated"]) } # Fall back to indexed aggregation (still O(n) but faster than Python) cursor = conn.execute(""" SELECT COUNT(*) as total_count, category, COUNT(*) as cat_count FROM news_items GROUP BY category """) category_counts = {row["category"]: row["cat_count"] for row in cursor.fetchall()} cursor = conn.execute(""" SELECT source, COUNT(*) as count FROM news_items GROUP BY source """) source_counts = {row["source"]: row["count"] for row in cursor.fetchall()} cursor = conn.execute(""" SELECT a.name, COUNT(*) as count FROM news_anomalies na JOIN anomaly_types a ON na.anomaly_id = a.id GROUP BY a.name """) anomaly_counts = {row["name"]: row["count"] for row in cursor.fetchall()} return { "total_count": sum(category_counts.values()), "category_counts": category_counts, "source_counts": source_counts, "anomaly_counts": anomaly_counts, "last_updated": datetime.now() } async def search_fts( self, query: str, limit: int = 10, category: Optional[str] = None ) -> List[str]: """Full-text search using SQLite FTS5 - O(log n).""" with self._get_connection() as conn: # Escape FTS5 special characters fts_query = query.replace('"', '""') if category: cursor = conn.execute(""" SELECT n.id FROM news_items n JOIN news_fts fts ON n.id = fts.id WHERE news_fts MATCH ? AND n.category = ? LIMIT ? """, (f'"{fts_query}"', category, limit)) else: cursor = conn.execute(""" SELECT id FROM news_fts WHERE news_fts MATCH ? LIMIT ? """, (f'"{fts_query}"', limit)) return [row["id"] for row in cursor.fetchall()] ``` #### 3.3.2 Update ChromaStore to Use SQLite Index ```python # src/storage/chroma_store.py (updated) class ChromaStore(IVectorStore): """ Updated ChromaStore that delegates indexed queries to SQLite. Maintains backward compatibility. """ def __init__( self, client: ClientAPI, collection_name: str = "news_collection", sqlite_store: Optional[SQLiteStore] = None ): self.client = client self.collection = self.client.get_or_create_collection(name=collection_name) self.sqlite = sqlite_store # New: SQLite shadow for indexed queries async def get_latest( self, limit: int = 10, category: Optional[str] = None, offset: int = 0 # NEW: Pagination support ) -> List[EnrichedNewsItemDTO]: """Get latest using SQLite index - O(log n + k).""" if self.sqlite and offset == 0: # Only use fast path for simple queries rows = await self.sqlite.get_latest_indexed(limit, category) # Reconstruct DTOs from SQLite rows return [self._row_to_dto(row) for row in rows] # Fallback to legacy behavior with pagination return await self._get_latest_legacy(limit, category, offset) async def _get_latest_legacy( self, limit: int, category: Optional[str], offset: int ) -> List[EnrichedNewsItemDTO]: """Legacy implementation with pagination.""" where: Any = {"category": category} if category else None results = await asyncio.to_thread( self.collection.get, include=["metadatas", "documents"], where=where ) # ... existing sorting logic ... # Apply offset/limit after sorting return items[offset:offset + limit] async def get_stats(self) -> Dict[str, Any]: """Get stats - O(1) from cache.""" if self.sqlite: return await self.sqlite.get_stats_fast() return await self._get_stats_legacy() ``` ### 3.4 Phase 3: Hybrid Search with RRF (Week 4) **Objective:** Implement hybrid keyword + semantic search using RRF fusion. ```python # src/storage/hybrid_search.py from typing import List, Tuple from src.processor.dto import EnrichedNewsItemDTO class HybridSearchStrategy: """ Reciprocal Rank Fusion for combining keyword and semantic search. Implements RRF: Score = Σ 1/(k + rank_i) """ def __init__(self, k: int = 60): self.k = k # RRF smoothing factor def fuse( self, keyword_results: List[Tuple[str, float]], # (id, score) semantic_results: List[Tuple[str, float]], # (id, score) ) -> List[Tuple[str, float]]: """ Fuse results using RRF. Higher fused score = better. """ fused_scores: Dict[str, float] = {} # Keyword results get higher initial weight for rank, (doc_id, score) in enumerate(keyword_results): rrf_score = 1 / (self.k + rank) fused_scores[doc_id] = fused_scores.get(doc_id, 0) + rrf_score * 2.0 # Semantic results for rank, (doc_id, score) in enumerate(semantic_results): rrf_score = 1 / (self.k + rank) fused_scores[doc_id] = fused_scores.get(doc_id, 0) + rrf_score * 1.0 # Sort by fused score descending sorted_results = sorted( fused_scores.items(), key=lambda x: x[1], reverse=True ) return sorted_results async def search( self, query: str, limit: int = 5, category: Optional[str] = None, threshold: Optional[float] = None ) -> List[EnrichedNewsItemDTO]: """ Execute hybrid search: 1. SQLite FTS for exact keyword matches 2. ChromaDB for semantic similarity 3. RRF fusion """ # Phase 1: Keyword search (FTS) fts_ids = await self.sqlite.search_fts(query, limit * 2, category) # Phase 2: Semantic search semantic_results = await self.chroma.search( query, limit=limit * 2, category=category, threshold=threshold ) # Phase 3: RRF Fusion keyword_scores = [(id, 1.0) for id in fts_ids] # FTS doesn't give scores semantic_scores = [ (self._get_doc_id(item), self._compute_adaptive_score(item)) for item in semantic_results ] fused = self.fuse(keyword_scores, semantic_scores) # Retrieve full DTOs in fused order final_results = [] for doc_id, _ in fused[:limit]: dto = await self.chroma.get_by_id(doc_id) if dto: final_results.append(dto) return final_results def _compute_adaptive_score(self, item: EnrichedNewsItemDTO) -> float: """ Compute adaptive score combining relevance and recency. More recent items with same relevance get slight boost. """ # Normalize relevance to 0-1 relevance_norm = item.relevance_score / 10.0 # Days since publication (exponential decay) days_old = (datetime.now() - item.timestamp).days recency_decay = 0.95 ** days_old return relevance_norm * 0.7 + recency_decay * 0.3 ``` ### 3.5 Phase 4: Stats Cache with Invalidation (Week 5) **Objective:** Eliminate O(n) stats computation via incremental updates. ```python # src/storage/stats_cache.py from dataclasses import dataclass, field from datetime import datetime from typing import Dict, List import asyncio import json @dataclass class StatsCache: """Thread-safe statistics cache with incremental updates.""" total_count: int = 0 category_counts: Dict[str, int] = field(default_factory=dict) source_counts: Dict[str, int] = field(default_factory=dict) anomaly_counts: Dict[str, int] = field(default_factory=dict) last_updated: datetime = field(default_factory=datetime.now) _lock: asyncio.Lock = field(default_factory=asyncio.Lock) async def invalidate(self): """Invalidate cache - marks for recomputation.""" async with self._lock: self.total_count = -1 # Sentinel value self.category_counts = {} self.source_counts = {} self.anomaly_counts = {} async def apply_delta(self, delta: "StatsDelta"): """ Apply incremental update without full recomputation. O(1) operation. """ async with self._lock: self.total_count += delta.total_delta for cat, count in delta.category_deltas.items(): self.category_counts[cat] = self.category_counts.get(cat, 0) + count for source, count in delta.source_deltas.items(): self.source_counts[source] = self.source_counts.get(source, 0) + count for anomaly, count in delta.anomaly_deltas.items(): self.anomaly_counts[anomaly] = self.anomaly_counts.get(anomaly, 0) + count self.last_updated = datetime.now() def is_valid(self) -> bool: """Check if cache is populated.""" return self.total_count >= 0 @dataclass class StatsDelta: """Delta for incremental stats update.""" total_delta: int = 0 category_deltas: Dict[str, int] = field(default_factory=dict) source_deltas: Dict[str, int] = field(default_factory=dict) anomaly_deltas: Dict[str, int] = field(default_factory=dict) @classmethod def from_item_add(cls, item: EnrichedNewsItemDTO) -> "StatsDelta": """Create delta for adding an item.""" return cls( total_delta=1, category_deltas={item.category: 1}, source_deltas={item.source: 1}, anomaly_deltas={a.value: 1 for a in item.anomalies_detected} ) # Integration with storage class CachedVectorStore(IVectorStore): """VectorStore with incremental stats cache.""" def __init__(self, ...): self.stats_cache = StatsCache() # ... async def store(self, item: EnrichedNewsItemDTO) -> str: # Store item item_id = await self._store_impl(item) # Update cache incrementally delta = StatsDelta.from_item_add(item) await self.stats_cache.apply_delta(delta) return item_id async def get_stats(self) -> Dict[str, Any]: """Get stats - O(1) from cache if valid.""" if self.stats_cache.is_valid(): return { "total_count": self.stats_cache.total_count, "category_counts": self.stats_cache.category_counts, "source_counts": self.stats_cache.source_counts, "anomaly_counts": self.stats_cache.anomaly_counts, "last_updated": self.stats_cache.last_updated } # Fall back to full recomputation (still via SQLite) return await self._recompute_stats() ``` --- ## 4. Backward Compatibility Strategy ### 4.1 Interface Compatibility The `IVectorStore` interface remains unchanged. New methods have default implementations: ```python class IVectorStore(IStoreCommand, IStoreQuery): """Backward-compatible interface.""" async def get_latest( self, limit: int = 10, category: Optional[str] = None, offset: int = 0 # NEW: Optional pagination ) -> List[EnrichedNewsItemDTO]: """ Default implementation maintains legacy behavior. Override in implementation for optimized path. """ return await self._default_get_latest(limit, category) async def get_top_ranked( self, limit: int = 10, category: Optional[str] = None, offset: int = 0 # NEW: Optional pagination ) -> List[EnrichedNewsItemDTO]: """Default implementation maintains legacy behavior.""" return await self._default_get_top_ranked(limit, category) ``` ### 4.2 DTO Compatibility Layer ```python # src/storage/compat.py class DTOConverter: """ Converts between legacy and normalized DTO formats. Ensures smooth transition. """ @staticmethod def normalize_anomalies( anomalies: Union[List[str], str] ) -> List[AnomalyType]: """ Handle both legacy comma-joined strings and new list format. """ if isinstance(anomalies, str): # Legacy format: "WebGPU,NPU acceleration" return [AnomalyType.from_string(a) for a in anomalies.split(",") if a] return [AnomalyType.from_string(a) if isinstance(a, str) else a for a in anomalies] @staticmethod def to_legacy_dict(dto: EnrichedNewsItemDTO) -> Dict[str, Any]: """ Convert normalized DTO to legacy format for API compatibility. """ return { **dto.model_dump(), # Legacy comma-joined format for API consumers "anomalies_detected": ",".join(a.value if isinstance(a, AnomalyType) else str(a) for a in dto.anomalies_detected) } ``` ### 4.3 Dual-Mode Storage ```python # src/storage/adaptive_store.py class AdaptiveVectorStore(IVectorStore): """ Storage adapter that switches between legacy and normalized based on feature flags. """ def __init__( self, legacy_store: ChromaStore, normalized_store: NormalizedChromaStore, feature_flags: FeatureFlags ): self.legacy = legacy_store self.normalized = normalized_store self.flags = feature_flags async def get_latest( self, limit: int = 10, category: Optional[str] = None, offset: int = 0 ) -> List[EnrichedNewsItemDTO]: if is_enabled(FeatureFlags.NORMALIZED_STORAGE): return await self.normalized.get_latest(limit, category, offset) return await self.legacy.get_latest(limit, category) # Similar delegation for all methods... ``` --- ## 5. Risk Mitigation ### 5.1 Rollback Plan ``` ┌─────────────────────────────────────────────────────────────────┐ │ ROLLBACK DECISION TREE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Phase 1 Complete │ │ ├── Feature Flag: NORMALIZED_STORAGE = ON │ │ ├── Dual-write active │ │ └── Can rollback: YES (disable flag, use legacy) │ │ │ │ Phase 2 Complete │ │ ├── SQLite indexes active │ │ ├── ChromaDB still primary │ │ └── Can rollback: YES (disable flag, rebuild legacy indexes) │ │ │ │ Phase 3 Complete │ │ ├── Hybrid search active │ │ ├── FTS data populated │ │ └── Can rollback: YES (disable flag, purge FTS triggers) │ │ │ │ Phase 4 Complete │ │ ├── Stats cache active │ │ └── Can rollback: YES (invalidate cache, use legacy) │ │ │ │ ⚠️ CRITICAL: After Phase 4, data divergence makes │ │ clean rollback impossible. Requires data sync tool. │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### 5.2 Data Validation Strategy ```python # tests/migration/test_data_integrity.py import pytest from datetime import datetime, timezone from src.processor.dto import EnrichedNewsItemDTO from src.storage.validators import DataIntegrityValidator class TestDataIntegrity: """ Comprehensive data integrity tests for migration. Run after each phase to validate. """ @pytest.fixture def validator(self): return DataIntegrityValidator() @pytest.mark.asyncio async def test_anomaly_roundtrip(self, validator): """Test anomaly normalization roundtrip.""" original_anomalies = ["WebGPU", "NPU acceleration", "Edge AI"] dto = EnrichedNewsItemDTO( ..., anomalies_detected=original_anomalies ) # Store item_id = await storage.store(dto) # Retrieve retrieved = await storage.get_by_id(item_id) # Validate normalized form validator.assert_anomalies_match( retrieved.anomalies_detected, original_anomalies ) @pytest.mark.asyncio async def test_legacy_compat_mode(self, validator): """ Test that legacy comma-joined strings still work. Simulates old client sending comma-joined anomalies. """ # Simulate legacy input legacy_metadata = { "title": "Test", "url": "https://example.com/legacy", "content_text": "Content", "source": "LegacySource", "timestamp": "2024-01-01T00:00:00", "relevance_score": 5, "summary_ru": "Резюме", "category": "Tech", "anomalies_detected": "WebGPU,Edge AI" # Legacy format } # Should be automatically normalized dto = DTOConverter.legacy_metadata_to_dto(legacy_metadata) validator.assert_valid_dto(dto) @pytest.mark.asyncio async def test_stats_consistency(self, validator): """ Test that stats from cache match actual data. Validates incremental update correctness. """ # Add items await storage.store(item1) await storage.store(item2) # Get cached stats cached_stats = await storage.get_stats(use_cache=True) # Get actual counts actual_stats = await storage.get_stats(use_cache=False) validator.assert_stats_match(cached_stats, actual_stats) @pytest.mark.asyncio async def test_pagination_consistency(self, validator): """ Test that pagination doesn't skip or duplicate items. """ items = [create_test_item(i) for i in range(100)] for item in items: await storage.store(item) # Get first page page1 = await storage.get_latest(limit=10, offset=0) # Get second page page2 = await storage.get_latest(limit=10, offset=10) # Validate no overlap validator.assert_no_overlap(page1, page2) # Validate total count total = await storage.get_stats() assert sum(len(p) for p in [page1, page2]) <= total["total_count"] ``` ### 5.3 Monitoring and Alerts ```python # src/monitoring/migration_health.py from dataclasses import dataclass from datetime import datetime from typing import Dict, Any @dataclass class MigrationHealthCheck: """Health checks for migration progress.""" phase: int checks: Dict[str, bool] def is_healthy(self) -> bool: return all(self.checks.values()) def to_report(self) -> Dict[str, Any]: return { "phase": self.phase, "healthy": self.is_healthy(), "checks": self.checks, "timestamp": datetime.now().isoformat() } class MigrationMonitor: """Monitor migration health and emit alerts.""" async def check_phase1_health(self) -> MigrationHealthCheck: """Phase 1 health checks.""" checks = { "dual_write_active": await self._check_dual_write(), "anomaly_normalized": await self._check_anomaly_normalization(), "no_data_loss": await self._check_data_integrity(), "backward_compat": await self._check_compat_mode(), } return MigrationHealthCheck(phase=1, checks=checks) async def _check_data_integrity(self) -> bool: """ Sample 100 items and verify normalized vs legacy match. """ legacy_store = ChromaStore(...) normalized_store = NormalizedChromaStore(...) sample_ids = await legacy_store._sample_ids(100) for item_id in sample_ids: legacy_item = await legacy_store.get_by_id(item_id) normalized_item = await normalized_store.get_by_id(item_id) if not self._items_match(legacy_item, normalized_item): return False return True def _items_match( self, legacy: EnrichedNewsItemDTO, normalized: EnrichedNewsItemDTO ) -> bool: """Verify legacy and normalized items are semantically equal.""" return ( legacy.title == normalized.title and legacy.url == normalized.url and legacy.relevance_score == normalized.relevance_score and set(normalized.anomalies_detected) == set(legacy.anomalies_detected) ) ``` --- ## 6. Performance Targets ### 6.1 Benchmarks | Operation | Current | Phase 1 | Phase 2 | Phase 3 | Phase 4 | |-----------|---------|---------|---------|---------|---------| | `get_latest` (10 items) | 500ms | 100ms | 10ms | 10ms | 10ms | | `get_latest` (100 items) | 800ms | 200ms | 20ms | 20ms | 20ms | | `get_top_ranked` (10) | 500ms | 100ms | 10ms | 10ms | 10ms | | `get_stats` | 200ms | 50ms | 20ms | 20ms | 1ms | | `search` (keyword) | 50ms | 50ms | 50ms | 10ms | 10ms | | `search` (semantic) | 100ms | 100ms | 100ms | 50ms | 50ms | | `search` (hybrid) | N/A | N/A | N/A | 60ms | 30ms | | `store` (single) | 10ms | 15ms | 15ms | 15ms | 15ms | | Memory (1000 items) | O(n) | O(n) | O(1) | O(1) | O(1) | ### 6.2 Scaling Expectations ``` ┌─────────────────────────────────────────────────────────────────┐ │ PERFORMANCE vs DATASET SIZE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ get_latest (limit=10) │ │ │ │ Current: O(n) │ │ ├─ 100 items: 50ms │ │ ├─ 1,000 items: 500ms │ │ ├─ 10,000 items: 5,000ms (5s) ⚠️ │ │ └─ 100,000 items: 50,000ms (50s) 🚫 │ │ │ │ Optimized (SQLite indexed): O(log n + k) │ │ ├─ 100 items: 5ms │ │ ├─ 1,000 items: 7ms │ │ ├─ 10,000 items: 10ms │ │ ├─ 100,000 items: 15ms │ │ └─ 1,000,000 items: 25ms │ │ │ │ Target: Support 1M+ items with <50ms latency │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## 7. Implementation Phases ### Phase 0: Infrastructure (Week 1) **Dependencies:** None **Owner:** Senior Architecture Engineer 1. Create feature flags system 2. Implement `DualWriter` class 3. Create migration test suite 4. Set up SQLite database initialization 5. Create `AnomalyType` enum 6. Implement `DTOConverter` for backward compat **Deliverables:** - `src/config/feature_flags.py` - `src/storage/dual_writer.py` - `src/processor/anomaly_types.py` - `src/storage/sqlite_store.py` (schema only) - `tests/migrations/` **Exit Criteria:** - [ ] All phase 0 tests pass - [ ] Feature flags system operational - [ ] Dual-write writes to both stores --- ### Phase 1: Normalized Storage (Week 2) **Dependencies:** Phase 0 **Owner:** Data Engineer 1. Implement `NormalizedChromaStore` 2. Create anomaly junction collection 3. Implement anomaly normalization 4. Update `IVectorStore` interface with new signatures 5. Create `AdaptiveVectorStore` for feature-flag switching 6. Write integration tests **Deliverables:** - `src/storage/normalized_chroma_store.py` - `src/storage/adaptive_store.py` - Updated `IVectorStore` interface **Exit Criteria:** - [ ] Normalized storage stores/retrieves correctly - [ ] Anomaly enum normalization works - [ ] Dual-write to legacy + normalized active - [ ] All tests pass with both stores --- ### Phase 2: Indexed Queries (Week 3) **Dependencies:** Phase 1 **Owner:** Backend Architect 1. Complete SQLite schema with indexes 2. Implement FTS5 virtual table and triggers 3. Implement `get_latest_indexed()` using SQLite 4. Implement `get_top_ranked_indexed()` using SQLite 5. Update `ChromaStore` to delegate to SQLite 6. Implement pagination (offset/limit) **Deliverables:** - Complete `SQLiteStore` implementation - `ChromaStore` updated with indexed queries - Pagination support in interface **Exit Criteria:** - [ ] `get_latest` uses SQLite index (verify with EXPLAIN) - [ ] `get_top_ranked` uses SQLite index - [ ] Pagination works correctly - [ ] No full-collection scans in hot path --- ### Phase 3: Hybrid Search (Week 4) **Dependencies:** Phase 2 **Owner:** Backend Architect 1. Implement `HybridSearchStrategy` with RRF 2. Integrate SQLite FTS with ChromaDB semantic 3. Add adaptive scoring (relevance + recency) 4. Implement `search_stream()` for large results 5. Performance test hybrid vs legacy **Deliverables:** - `src/storage/hybrid_search.py` - Updated `ChromaStore.search()` **Exit Criteria:** - [ ] Hybrid search returns better results than pure semantic - [ ] Keyword matches appear first - [ ] RRF fusion working correctly - [ ] Threshold filtering applies to semantic results --- ### Phase 4: Stats Cache (Week 5) **Dependencies:** Phase 3 **Owner:** Data Engineer 1. Implement `StatsCache` class with incremental updates 2. Integrate cache with `AdaptiveVectorStore` 3. Implement cache invalidation triggers 4. Add cache warming on startup 5. Performance test `get_stats()` **Deliverables:** - `src/storage/stats_cache.py` - Updated storage layer with cache **Exit Criteria:** - [ ] `get_stats` returns in <5ms (cached) - [ ] Cache invalidates correctly on store/delete - [ ] Cache rebuilds correctly on demand --- ### Phase 5: Validation & Cutover (Week 6) **Dependencies:** All previous phases **Owner:** QA Engineer 1. Run full migration test suite 2. Performance benchmark comparison 3. Load test with simulated traffic 4. Validate all Telegram bot commands work 5. Generate migration completion report 6. Disable legacy code paths (optional) **Deliverables:** - Migration completion report - Performance benchmark report - Legacy code removal (if approved) **Exit Criteria:** - [ ] All tests pass - [ ] Performance targets met - [ ] No regression in bot functionality - [ ] Stakeholder sign-off --- ## 8. Open Questions ### 8.1 Data Migration | Question | Impact | Priority | |----------|--------|----------| | Should we migrate existing comma-joined anomalies to normalized junction records? | Data integrity | HIGH | | What is the expected dataset size at migration time? | Performance planning | HIGH | | Can we have downtime for the migration, or must it be zero-downtime? | Rollback strategy | HIGH | | Should we keep legacy ChromaDB collection or archive it? | Storage costs | MEDIUM | ### 8.2 Architecture | Question | Impact | Priority | |----------|--------|----------| | Do we want to eventually migrate away from ChromaDB to a more scalable solution (Qdrant, Weaviate)? | Future planning | MEDIUM | | Should anomalies be stored in ChromaDB or only in SQLite? | Consistency model | MEDIUM | | Do we need multi-tenancy support for multiple Telegram channels? | Future feature | LOW | ### 8.3 Operations | Question | Impact | Priority | |----------|--------|----------| | What is the acceptable cache staleness for `get_stats`? | Consistency vs performance | HIGH | | Should we implement TTL for old items? | Storage management | MEDIUM | | Do we need backup/restore procedures for SQLite? | Disaster recovery | HIGH | ### 8.4 Testing | Question | Impact | Priority | |----------|--------|----------| | Should we implement chaos testing for dual-write failure modes? | Reliability | MEDIUM | | What is the acceptable test coverage threshold? | Quality | HIGH | | Do we need integration tests with real ChromaDB instance? | Confidence | HIGH | --- ## 9. Appendix ### A. File Structure After Migration ``` src/ ├── config/ │ ├── __init__.py │ ├── feature_flags.py # NEW │ └── settings.py ├── storage/ │ ├── __init__.py │ ├── base.py # UPDATED: Evolved interface │ ├── chroma_store.py # UPDATED: Delegates to SQLite │ ├── normalized_chroma_store.py # NEW │ ├── sqlite_store.py # NEW │ ├── hybrid_search.py # NEW │ ├── stats_cache.py # NEW │ ├── adaptive_store.py # NEW │ ├── dual_writer.py # NEW │ ├── compat.py # NEW │ └── migrations/ # NEW │ ├── __init__.py │ ├── v1_normalize_anomalies.py │ └── v2_add_indexes.py ├── processor/ │ ├── __init__.py │ ├── dto.py # UPDATED: New DTOs │ ├── anomaly_types.py # NEW │ └── ... ├── crawlers/ │ ├── __init__.py │ └── ... ├── bot/ │ ├── __init__.py │ ├── handlers.py # UPDATED: Use new pagination │ └── ... └── main.py # UPDATED: Initialize new stores ``` ### B. ChromaDB Version Consideration > **Note:** The current implementation uses ChromaDB's `upsert` with metadata. ChromaDB has limitations: > - No native sorting by metadata fields > - No native pagination (offset/limit) > - `where` clause filtering is limited > > The SQLite shadow database addresses these limitations while maintaining ChromaDB for vector operations. ### C. Monitoring Queries for Production ```sql -- Check SQLite index usage EXPLAIN QUERY PLAN SELECT * FROM news_items ORDER BY timestamp DESC LIMIT 10; -- Check FTS index EXPLAIN QUERY PLAN SELECT * FROM news_fts WHERE news_fts MATCH 'WebGPU'; -- Check cache freshness SELECT key, total_count, last_updated FROM stats_cache; -- Check anomaly distribution SELECT a.name, COUNT(*) as count FROM news_anomalies na JOIN anomaly_types a ON na.anomaly_id = a.id GROUP BY a.name ORDER BY count DESC; ``` --- ## 10. Approval | Role | Name | Date | Signature | |------|------|------|-----------| | Senior Architecture Engineer | | | | | Project Manager | | | | | QA Lead | | | | | DevOps Lead | | | | --- *Document generated for team implementation review.*