Artur Mukhamadiev f4ae73bdae feat(database): SQLite shadow database for indexed queries

:Release Notes:
- Add ACID-compliant SQLiteStore (WAL mode, FULL sync, FK constraints)
- Add AnomalyType enum for normalized anomaly storage
- Add legacy data migration script (dry-run, batch, rollback)
- Update ChromaStore to delegate indexed queries to SQLite
- Add test suite for SQLiteStore (7 tests, all passing)

:Detailed Notes:
- SQLiteStore: news_items, anomaly_types, news_anomalies tables with indexes
- Performance: get_latest/get_top_ranked O(n)→O(log n), get_stats O(n)→O(1)
- ChromaDB remains primary vector store; SQLite provides indexed metadata queries

:Testing Performed:
- python3 -m pytest tests/ -v (112 passed)

:QA Notes:
- Tests verified by Python QA Engineer subagent

:Issues Addressed:
- get_latest/get_top_ranked fetched ALL items then sorted in Python
- get_stats iterated over ALL items
- anomalies_detected stored as comma-joined string (no index)

Change-Id: I708808b6e72889869afcf16d4ac274260242007a

2026-03-30 13:54:48 +03:00

61 KiB

Raw Blame History

Trend-Scout AI: Database Schema Migration Plan

Document Version: 1.0
Date: 2026-03-30
Status: Draft for Review

1. Executive Summary

This document outlines a comprehensive migration plan to address critical performance and data architecture issues in the Trend-Scout AI vector storage layer. The migration transforms a flat, unnormalized ChromaDB collection into a normalized, multi-collection architecture with proper indexing, pagination, and query optimization.

Current Pain Points

Issue	Current Behavior	Impact
`get_latest`	Fetches ALL items via `collection.get()`, sorts in Python	O(n) memory + sort
`get_top_ranked`	Fetches ALL items via `collection.get()`, sorts in Python	O(n) memory + sort
`get_stats`	Iterates over ALL items to count categories	O(n) iteration
`anomalies_detected`	Stored as comma-joined string `"A1,A2,A3"`	Type corruption, no index
Single collection	No normalization, mixed data types	No efficient filtering
No pagination	No offset-based pagination support	Cannot handle large datasets

Expected Outcomes

Metric	Current	Target	Improvement
`get_latest` (1000 items)	~500ms, O(n)	~10ms, O(log n)	50x faster
`get_top_ranked` (1000 items)	~500ms, O(n log n)	~10ms, O(k log k)	50x faster
`get_stats` (1000 items)	~200ms	~1ms	200x faster
Memory usage	O(n) full scan	O(1) or O(k)	~99% reduction

2. Target Architecture

2.1 Multi-Collection Design

┌─────────────────────────────────────────────────────────────────────┐
│                     ChromaDB Instance                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────┐    ┌──────────────────────┐                │
│  │  news_items          │    │  category_index      │                │
│  │  (main collection)   │    │  (lookup table)      │                │
│  ├──────────────────────┤    ├──────────────────────┤                │
│  │ id (uuid5 from url)  │◄───┤ category_id          │                │
│  │ content_text (vec)  │    │ name                  │                │
│  │ title               │    │ item_count            │                │
│  │ url                 │    │ created_at            │                │
│  │ source              │    │ updated_at            │                │
│  │ timestamp           │    └──────────────────────┘                │
│  │ relevance_score     │                                             │
│  │ summary_ru          │    ┌──────────────────────┐                │
│  │ category_id (FK)   │───►│  anomaly_types       │                │
│  │ anomalies[]         │    │  (normalized)        │                │
│  └──────────────────────┘    ├──────────────────────┤                │
│                              │ anomaly_id          │                │
│                              │ name                 │                │
│                              │ description          │                │
│                              └──────────────────────┘                │
│                                                                     │
│  ┌──────────────────────┐    ┌──────────────────────┐                │
│  │  news_anomalies      │    │  stats_cache          │                │
│  │  (junction table)   │    │  (materialized)       │                │
│  ├──────────────────────┤    ├──────────────────────┤                │
│  │ news_id (FK)        │◄───┤ key                   │                │
│  │ anomaly_id (FK)     │    │ total_count           │                │
│  │ detected_at         │    │ category_counts (JSON)│                │
│  └──────────────────────┘    │ source_counts (JSON) │                │
│                              │ last_updated         │                │
│                              └──────────────────────┘                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

2.2 SQLite Shadow Database (for FTS and relational queries)

┌─────────────────────────────────────────────────────────────────────┐
│                     SQLite Database                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────┐    ┌──────────────────────┐                │
│  │  news_items_rel       │    │  categories           │                │
│  │  (normalized relay)  │    ├──────────────────────┤                │
│  ├──────────────────────┤    │ id (PK)               │                │
│  │ id (uuid, PK)        │    │ name                  │                │
│  │ title                │    │ item_count            │                │
│  │ url (UNIQUE)         │    │ embedding_id (FK)     │                │
│  │ source               │    └──────────────────────┘                │
│  │ timestamp            │                                             │
│  │ relevance_score     │    ┌──────────────────────┐                │
│  │ summary_ru           │    │  anomalies            │                │
│  │ category_id (FK)    │    ├──────────────────────┤                │
│  │ content_text        │    │ id (PK)               │                │
│  └──────────────────────┘    │ name                  │                │
│                              │ description           │                │
│  ┌──────────────────────┐    └──────────────────────┘                │
│  │  news_anomalies_rel │                                             │
│  ├──────────────────────┤    ┌──────────────────────┐                │
│  │ news_id (FK)        │    │  stats_snapshot       │                │
│  │ anomaly_id (FK)     │    ├──────────────────────┤                │
│  │ detected_at         │    │ total_count           │                │
│  └──────────────────────┘    │ category_json         │                │
│                              │ last_updated          │                │
│  ┌──────────────────────┐    └──────────────────────┘                │
│  │  news_fts            │                                             │
│  │  (FTS5 virtual)      │                                             │
│  ├──────────────────────┤    ┌──────────────────────┐                │
│  │ news_id (FK)        │    │  crawl_history        │                │
│  │ title_tokens        │    ├──────────────────────┤                │
│  │ content_tokens      │    │ id (PK)               │                │
│  │ summary_tokens      │    │ crawler_name          │                │
│  └──────────────────────┘    │ items_fetched        │                │
│                              │ items_new            │                │
│                              │ started_at            │                │
│                              │ completed_at          │                │
│                              └──────────────────────┘                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

2.3 New DTOs

# src/processor/dto.py (evolved)

from typing import List, Optional
from pydantic import BaseModel, Field
from datetime import datetime
from enum import Enum

class AnomalyType(str, Enum):
    """Normalized anomaly types - not a comma-joined string."""
    WEBGPU = "WebGPU"
    NPU_ACCELERATION = "NPU acceleration"
    EDGE_AI = "Edge AI"
    # ... extensible

class EnrichedNewsItemDTO(BaseModel):
    """Extended DTO with proper normalized relationships."""
    title: str
    url: str
    content_text: str
    source: str
    timestamp: datetime
    relevance_score: int = Field(ge=0, le=10)
    summary_ru: str
    category: str  # Category name, resolved from category_id
    anomalies_detected: List[AnomalyType]  # Now a proper list
    anomaly_ids: Optional[List[str]] = None  # Internal use

class CategoryStats(BaseModel):
    """Statistics per category."""
    category: str
    count: int
    avg_relevance: float

class StorageStats(BaseModel):
    """Complete storage statistics."""
    total_count: int
    categories: List[CategoryStats]
    sources: Dict[str, int]
    anomaly_types: Dict[str, int]
    last_updated: datetime
</code>

### 2.4 Evolved Interface Design

```python
# src/storage/base.py

from abc import ABC, abstractmethod
from typing import List, Optional, AsyncIterator, Dict, Any
from datetime import datetime
from src.processor.dto import EnrichedNewsItemDTO, StorageStats, CategoryStats

class IStoreCommand(ABC):
    """Write operations - SRP: only handles writes."""

    @abstractmethod
    async def store(self, item: EnrichedNewsItemDTO) -> str:
        """Store an item. Returns the generated ID."""
        pass

    @abstractmethod
    async def store_batch(self, items: List[EnrichedNewsItemDTO]) -> List[str]:
        """Batch store items. Returns generated IDs."""
        pass

    @abstractmethod
    async def update(self, item_id: str, item: EnrichedNewsItemDTO) -> None:
        """Update an existing item."""
        pass

    @abstractmethod
    async def delete(self, item_id: str) -> None:
        """Delete an item by ID."""
        pass

class IStoreQuery(ABC):
    """Read operations - SRP: only handles queries."""

    @abstractmethod
    async def get_by_id(self, item_id: str) -> Optional[EnrichedNewsItemDTO]:
        pass

    @abstractmethod
    async def exists(self, url: str) -> bool:
        pass

    @abstractmethod
    async def get_latest(
        self,
        limit: int = 10,
        category: Optional[str] = None,
        offset: int = 0
    ) -> List[EnrichedNewsItemDTO]:
        """Paginated retrieval by timestamp. Uses index, not full scan."""
        pass

    @abstractmethod
    async def get_top_ranked(
        self,
        limit: int = 10,
        category: Optional[str] = None,
        offset: int = 0
    ) -> List[EnrichedNewsItemDTO]:
        """Paginated retrieval by relevance_score. Uses index, not full scan."""
        pass

    @abstractmethod
    async def get_stats(self, use_cache: bool = True) -> StorageStats:
        """Returns cached stats or computes on-demand."""
        pass

    @abstractmethod
    async def search_hybrid(
        self,
        query: str,
        limit: int = 5,
        category: Optional[str] = None,
        threshold: Optional[float] = None
    ) -> List[EnrichedNewsItemDTO]:
        """Keyword + semantic hybrid search with RRF fusion."""
        pass

    @abstractmethod
    async def search_stream(
        self,
        query: str,
        limit: int = 5,
        category: Optional[str] = None
    ) -> AsyncIterator[EnrichedNewsItemDTO]:
        """Streaming search for large result sets."""
        pass

class IVectorStore(IStoreCommand, IStoreQuery):
    """Combined interface preserving backward compatibility."""
    pass

class IAdminOperations(ABC):
    """Administrative operations - separate from core CRUD."""

    @abstractmethod
    async def rebuild_stats_cache(self) -> StorageStats:
        """Force rebuild of statistics cache."""
        pass

    @abstractmethod
    async def vacuum(self) -> None:
        """Optimize storage."""
        pass

    @abstractmethod
    async def get_health(self) -> Dict[str, Any]:
        """Health check for monitoring."""
        pass

3. Migration Strategy

3.1 Phase 0: Preparation (Week 1)

Objective: Establish infrastructure for zero-downtime migration.

3.1.1 Create Feature Flags System

# src/config/feature_flags.py

from enum import Flag, auto

class FeatureFlags(Flag):
    """Feature flags for phased migration."""
    NORMALIZED_STORAGE = auto()
    SQLITE_SHADOW_DB = auto()
    FTS_ENABLED = auto()
    STATS_CACHE = auto()
    PAGINATION = auto()
    HYBRID_SEARCH = auto()

# Global config
FEATURE_FLAGS = FeatureFlags.NORMALIZED_STORAGE | FeatureFlags.STATS_CACHE

def is_enabled(flag: FeatureFlags) -> bool:
    return flag in FEATURE_FLAGS

3.1.2 Implement Shadow Write (Dual-Write Pattern)

# src/storage/dual_writer.py

class DualWriter:
    """
    Writes to both legacy and new storage simultaneously.
    Enables gradual migration with no data loss.
    """

    def __init__(
        self,
        legacy_store: ChromaStore,
        normalized_store: 'NormalizedChromaStore',
        sqlite_store: 'SQLiteStore'
    ):
        self.legacy = legacy_store
        self.normalized = normalized_store
        self.sqlite = sqlite_store

    async def store(self, item: EnrichedNewsItemDTO) -> str:
        # Write to both systems
        legacy_id = await self.legacy.store(item)
        normalized_id = await self.normalized.store(item)
        sqlite_id = await self.sqlite.store_relational(item)

        # Return normalized ID as source of truth
        return normalized_id

3.1.3 Create Migration Test Suite

# tests/migrations/test_normalized_storage.py

import pytest
from datetime import datetime, timezone
from src.processor.dto import EnrichedNewsItemDTO
from src.storage.normalized_chroma_store import NormalizedChromaStore

@pytest.mark.asyncio
async def test_migration_preserves_data_integrity():
    """
    Test that normalized storage preserves all data from legacy format.
    Round-trip: Legacy DTO -> Normalized Storage -> Legacy DTO
    """
    original = EnrichedNewsItemDTO(
        title="Test",
        url="https://example.com/test",
        content_text="Content",
        source="TestSource",
        timestamp=datetime.now(timezone.utc),
        relevance_score=8,
        summary_ru="Резюме",
        anomalies_detected=["WebGPU", "Edge AI"],
        category="Tech"
    )

    store = NormalizedChromaStore(...)
    item_id = await store.store(original)

    retrieved = await store.get_by_id(item_id)

    assert retrieved.title == original.title
    assert retrieved.url == original.url
    assert retrieved.relevance_score == original.relevance_score
    assert retrieved.category == original.category
    assert set(retrieved.anomalies_detected) == set(original.anomalies_detected)

@pytest.mark.asyncio
async def test_anomaly_normalization():
    """
    Test that anomalies are properly normalized to enum values.
    """
    dto = EnrichedNewsItemDTO(
        ...,
        anomalies_detected=["webgpu", "NPU acceleration", "Edge AI"]
    )

    store = NormalizedChromaStore(...)
    item_id = await store.store(dto)

    retrieved = await store.get_by_id(item_id)

    # Should be normalized to canonical forms
    assert AnomalyType.WEBGPU in retrieved.anomalies_detected
    assert AnomalyType.NPU_ACCELERATION in retrieved.anomalies_detected
    assert AnomalyType.EDGE_AI in retrieved.anomalies_detected

3.2 Phase 1: Normalized Storage (Week 2)

Objective: Introduce normalized anomaly storage without breaking existing queries.

3.2.1 New Normalized ChromaStore

# src/storage/normalized_chroma_store.py

class NormalizedChromaStore(IVectorStore):
    """
    ChromaDB storage with normalized anomaly types.
    Maintains backward compatibility via IVectorStore interface.
    """

    # Collections
    NEWS_COLLECTION = "news_items"
    ANOMALY_COLLECTION = "anomaly_types"
    JUNCTION_COLLECTION = "news_anomalies"

    async def store(self, item: EnrichedNewsItemDTO) -> str:
        """Store with normalized anomaly handling."""
        anomaly_ids = await self._ensure_anomalies(item.anomalies_detected)

        # Store main item
        doc_id = str(uuid.uuid5(uuid.NAMESPACE_URL, item.url))

        metadata = {
            "title": item.title,
            "url": item.url,
            "source": item.source,
            "timestamp": item.timestamp.isoformat(),
            "relevance_score": item.relevance_score,
            "summary_ru": item.summary_ru,
            "category": item.category,
            # NO MORE COMMA-JOINED STRING
        }

        # Store junction records for anomalies
        for anomaly_id in anomaly_ids:
            await self._store_junction(doc_id, anomaly_id)

        # Vector storage (without anomalies in metadata)
        await asyncio.to_thread(
            self.collection.upsert,
            ids=[doc_id],
            documents=[item.content_text],
            metadatas=[metadata]
        )

        return doc_id

    async def _ensure_anomalies(
        self,
        anomalies: List[str]
    ) -> List[str]:
        """Ensure anomaly types exist, return IDs."""
        anomaly_ids = []
        for anomaly in anomalies:
            # Normalize to canonical form
            normalized = AnomalyType.from_string(anomaly)
            anomaly_id = str(uuid.uuid5(uuid.NAMESPACE_URL, normalized.value))

            # Upsert anomaly type
            await asyncio.to_thread(
                self.anomaly_collection.upsert,
                ids=[anomaly_id],
                documents=[normalized.value],
                metadatas=[{
                    "name": normalized.value,
                    "description": normalized.description
                }]
            )
            anomaly_ids.append(anomaly_id)

        return anomaly_ids

3.2.2 AnomalyType Enum

# src/processor/anomaly_types.py

from enum import Enum

class AnomalyType(str, Enum):
    """Canonical anomaly types with metadata."""

    WEBGPU = "WebGPU"
    NPU_ACCELERATION = "NPU acceleration"
    EDGE_AI = "Edge AI"
    QUANTUM_COMPUTING = "Quantum computing"
    NEUROMORPHIC = "Neuromorphic computing"
    SPATIAL_COMPUTING = "Spatial computing"
    UNKNOWN = "Unknown"

    @classmethod
    def from_string(cls, value: str) -> "AnomalyType":
        """Fuzzy matching from raw string."""
        normalized = value.strip().lower()
        for member in cls:
            if member.value.lower() == normalized:
                return member
        # Try partial match
        for member in cls:
            if normalized in member.value.lower():
                return member
        return cls.UNKNOWN

    @property
    def description(self) -> str:
        """Human-readable description."""
        descriptions = {
            "WebGPU": "WebGPU graphics API or GPU compute",
            "NPU acceleration": "Neural Processing Unit hardware",
            "Edge AI": "Edge computing with AI",
            # ...
        }
        return descriptions.get(self.value, "")

3.3 Phase 2: Indexed Queries (Week 3)

Objective: Eliminate full-collection scans for get_latest, get_top_ranked, get_stats.

3.3.1 SQLite Shadow Database

# src/storage/sqlite_store.py

import sqlite3
from pathlib import Path
from typing import List, Optional, Dict, Any
from contextlib import asynccontextmanager
from datetime import datetime
import json

class SQLiteStore:
    """
    SQLite shadow database for relational queries and FTS.
    Maintains sync with ChromaDB news_items collection.
    """

    def __init__(self, db_path: Path):
        self.db_path = db_path
        self._init_schema()

    def _init_schema(self):
        """Initialize SQLite schema with proper indexes."""
        with self._get_connection() as conn:
            conn.executescript("""
                -- Main relational table
                CREATE TABLE IF NOT EXISTS news_items (
                    id TEXT PRIMARY KEY,
                    title TEXT NOT NULL,
                    url TEXT UNIQUE NOT NULL,
                    source TEXT NOT NULL,
                    timestamp INTEGER NOT NULL,  -- Unix epoch
                    relevance_score INTEGER NOT NULL,
                    summary_ru TEXT,
                    category TEXT NOT NULL,
                    content_text TEXT,
                    created_at INTEGER DEFAULT (unixepoch())
                );

                -- Indexes for fast queries
                CREATE INDEX IF NOT EXISTS idx_news_timestamp
                    ON news_items(timestamp DESC);

                CREATE INDEX IF NOT EXISTS idx_news_relevance
                    ON news_items(relevance_score DESC);

                CREATE INDEX IF NOT EXISTS idx_news_category
                    ON news_items(category);

                CREATE INDEX IF NOT EXISTS idx_news_source
                    ON news_items(source);

                -- FTS5 virtual table for full-text search
                CREATE VIRTUAL TABLE IF NOT EXISTS news_fts USING fts5(
                    id,
                    title,
                    content_text,
                    summary_ru,
                    content='news_items',
                    content_rowid='rowid'
                );

                -- Triggers to keep FTS in sync
                CREATE TRIGGER IF NOT EXISTS news_fts_insert AFTER INSERT ON news_items
                BEGIN
                    INSERT INTO news_fts(rowid, id, title, content_text, summary_ru)
                    VALUES (NEW.rowid, NEW.id, NEW.title, NEW.content_text, NEW.summary_ru);
                END;

                CREATE TRIGGER IF NOT EXISTS news_fts_delete AFTER DELETE ON news_items
                BEGIN
                    INSERT INTO news_fts(news_fts, rowid, id, title, content_text, summary_ru)
                    VALUES ('delete', OLD.rowid, OLD.id, OLD.title, OLD.content_text, OLD.summary_ru);
                END;

                CREATE TRIGGER IF NOT EXISTS news_fts_update AFTER UPDATE ON news_items
                BEGIN
                    INSERT INTO news_fts(news_fts, rowid, id, title, content_text, summary_ru)
                    VALUES ('delete', OLD.rowid, OLD.id, OLD.title, OLD.content_text, OLD.summary_ru);
                    INSERT INTO news_fts(rowid, id, title, content_text, summary_ru)
                    VALUES (NEW.rowid, NEW.id, NEW.title, NEW.content_text, NEW.summary_ru);
                END;

                -- Stats cache table
                CREATE TABLE IF NOT EXISTS stats_cache (
                    key TEXT PRIMARY KEY DEFAULT 'main',
                    total_count INTEGER DEFAULT 0,
                    category_counts TEXT DEFAULT '{}',  -- JSON
                    source_counts TEXT DEFAULT '{}',     -- JSON
                    anomaly_counts TEXT DEFAULT '{}',    -- JSON
                    last_updated INTEGER
                );

                -- Anomaly types table
                CREATE TABLE IF NOT EXISTS anomaly_types (
                    id TEXT PRIMARY KEY,
                    name TEXT UNIQUE NOT NULL,
                    description TEXT
                );

                -- Junction table for news-anomaly relationships
                CREATE TABLE IF NOT EXISTS news_anomalies (
                    news_id TEXT NOT NULL,
                    anomaly_id TEXT NOT NULL,
                    detected_at INTEGER DEFAULT (unixepoch()),
                    PRIMARY KEY (news_id, anomaly_id),
                    FOREIGN KEY (news_id) REFERENCES news_items(id),
                    FOREIGN KEY (anomaly_id) REFERENCES anomaly_types(id)
                );

                CREATE INDEX IF NOT EXISTS idx_news_anomalies_news
                    ON news_anomalies(news_id);

                CREATE INDEX IF NOT EXISTS idx_news_anomalies_anomaly
                    ON news_anomalies(anomaly_id);

                -- Crawl history for monitoring
                CREATE TABLE IF NOT EXISTS crawl_history (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    crawler_name TEXT NOT NULL,
                    items_fetched INTEGER DEFAULT 0,
                    items_new INTEGER DEFAULT 0,
                    started_at INTEGER,
                    completed_at INTEGER,
                    error_message TEXT
                );
            """)

    @asynccontextmanager
    def _get_connection(self):
        """Async context manager for SQLite connections."""
        conn = sqlite3.connect(self.db_path)
        conn.row_factory = sqlite3.Row
        try:
            yield conn
        finally:
            conn.close()

    # --- Fast Indexed Queries ---

    async def get_latest_indexed(
        self,
        limit: int = 10,
        category: Optional[str] = None,
        offset: int = 0
    ) -> List[Dict[str, Any]]:
        """Get latest items using SQLite index - O(log n + k)."""
        with self._get_connection() as conn:
            if category:
                cursor = conn.execute("""
                    SELECT * FROM news_items
                    WHERE category = ?
                    ORDER BY timestamp DESC
                    LIMIT ? OFFSET ?
                """, (category, limit, offset))
            else:
                cursor = conn.execute("""
                    SELECT * FROM news_items
                    ORDER BY timestamp DESC
                    LIMIT ? OFFSET ?
                """, (limit, offset))

            return [dict(row) for row in cursor.fetchall()]

    async def get_top_ranked_indexed(
        self,
        limit: int = 10,
        category: Optional[str] = None,
        offset: int = 0
    ) -> List[Dict[str, Any]]:
        """Get top ranked items using SQLite index - O(log n + k)."""
        with self._get_connection() as conn:
            if category:
                cursor = conn.execute("""
                    SELECT * FROM news_items
                    WHERE category = ?
                    ORDER BY relevance_score DESC, timestamp DESC
                    LIMIT ? OFFSET ?
                """, (category, limit, offset))
            else:
                cursor = conn.execute("""
                    SELECT * FROM news_items
                    ORDER BY relevance_score DESC, timestamp DESC
                    LIMIT ? OFFSET ?
                """, (limit, offset))

            return [dict(row) for row in cursor.fetchall()]

    async def get_stats_fast(self) -> Dict[str, Any]:
        """
        Get stats from cache or compute incrementally.
        O(1) if cached, O(1) for reads from indexed tables.
        """
        with self._get_connection() as conn:
            # Try cache first
            cursor = conn.execute("SELECT * FROM stats_cache WHERE key = 'main'")
            row = cursor.fetchone()

            if row:
                return {
                    "total_count": row["total_count"],
                    "category_counts": json.loads(row["category_counts"]),
                    "source_counts": json.loads(row["source_counts"]),
                    "anomaly_counts": json.loads(row["anomaly_counts"]),
                    "last_updated": datetime.fromtimestamp(row["last_updated"])
                }

            # Fall back to indexed aggregation (still O(n) but faster than Python)
            cursor = conn.execute("""
                SELECT
                    COUNT(*) as total_count,
                    category,
                    COUNT(*) as cat_count
                FROM news_items
                GROUP BY category
            """)
            category_counts = {row["category"]: row["cat_count"] for row in cursor.fetchall()}

            cursor = conn.execute("""
                SELECT source, COUNT(*) as count
                FROM news_items
                GROUP BY source
            """)
            source_counts = {row["source"]: row["count"] for row in cursor.fetchall()}

            cursor = conn.execute("""
                SELECT a.name, COUNT(*) as count
                FROM news_anomalies na
                JOIN anomaly_types a ON na.anomaly_id = a.id
                GROUP BY a.name
            """)
            anomaly_counts = {row["name"]: row["count"] for row in cursor.fetchall()}

            return {
                "total_count": sum(category_counts.values()),
                "category_counts": category_counts,
                "source_counts": source_counts,
                "anomaly_counts": anomaly_counts,
                "last_updated": datetime.now()
            }

    async def search_fts(
        self,
        query: str,
        limit: int = 10,
        category: Optional[str] = None
    ) -> List[str]:
        """Full-text search using SQLite FTS5 - O(log n)."""
        with self._get_connection() as conn:
            # Escape FTS5 special characters
            fts_query = query.replace('"', '""')

            if category:
                cursor = conn.execute("""
                    SELECT n.id FROM news_items n
                    JOIN news_fts fts ON n.id = fts.id
                    WHERE news_fts MATCH ?
                    AND n.category = ?
                    LIMIT ?
                """, (f'"{fts_query}"', category, limit))
            else:
                cursor = conn.execute("""
                    SELECT id FROM news_fts
                    WHERE news_fts MATCH ?
                    LIMIT ?
                """, (f'"{fts_query}"', limit))

            return [row["id"] for row in cursor.fetchall()]

3.3.2 Update ChromaStore to Use SQLite Index

# src/storage/chroma_store.py (updated)

class ChromaStore(IVectorStore):
    """
    Updated ChromaStore that delegates indexed queries to SQLite.
    Maintains backward compatibility.
    """

    def __init__(
        self,
        client: ClientAPI,
        collection_name: str = "news_collection",
        sqlite_store: Optional[SQLiteStore] = None
    ):
        self.client = client
        self.collection = self.client.get_or_create_collection(name=collection_name)
        self.sqlite = sqlite_store  # New: SQLite shadow for indexed queries

    async def get_latest(
        self,
        limit: int = 10,
        category: Optional[str] = None,
        offset: int = 0  # NEW: Pagination support
    ) -> List[EnrichedNewsItemDTO]:
        """Get latest using SQLite index - O(log n + k)."""
        if self.sqlite and offset == 0:  # Only use fast path for simple queries
            rows = await self.sqlite.get_latest_indexed(limit, category)
            # Reconstruct DTOs from SQLite rows
            return [self._row_to_dto(row) for row in rows]

        # Fallback to legacy behavior with pagination
        return await self._get_latest_legacy(limit, category, offset)

    async def _get_latest_legacy(
        self,
        limit: int,
        category: Optional[str],
        offset: int
    ) -> List[EnrichedNewsItemDTO]:
        """Legacy implementation with pagination."""
        where: Any = {"category": category} if category else None
        results = await asyncio.to_thread(
            self.collection.get,
            include=["metadatas", "documents"],
            where=where
        )
        # ... existing sorting logic ...
        # Apply offset/limit after sorting
        return items[offset:offset + limit]

    async def get_stats(self) -> Dict[str, Any]:
        """Get stats - O(1) from cache."""
        if self.sqlite:
            return await self.sqlite.get_stats_fast()
        return await self._get_stats_legacy()

3.4 Phase 3: Hybrid Search with RRF (Week 4)

Objective: Implement hybrid keyword + semantic search using RRF fusion.

# src/storage/hybrid_search.py

from typing import List, Tuple
from src.processor.dto import EnrichedNewsItemDTO

class HybridSearchStrategy:
    """
    Reciprocal Rank Fusion for combining keyword and semantic search.
    Implements RRF: Score = Σ 1/(k + rank_i)
    """

    def __init__(self, k: int = 60):
        self.k = k  # RRF smoothing factor

    def fuse(
        self,
        keyword_results: List[Tuple[str, float]],  # (id, score)
        semantic_results: List[Tuple[str, float]],   # (id, score)
    ) -> List[Tuple[str, float]]:
        """
        Fuse results using RRF.
        Higher fused score = better.
        """
        fused_scores: Dict[str, float] = {}

        # Keyword results get higher initial weight
        for rank, (doc_id, score) in enumerate(keyword_results):
            rrf_score = 1 / (self.k + rank)
            fused_scores[doc_id] = fused_scores.get(doc_id, 0) + rrf_score * 2.0

        # Semantic results
        for rank, (doc_id, score) in enumerate(semantic_results):
            rrf_score = 1 / (self.k + rank)
            fused_scores[doc_id] = fused_scores.get(doc_id, 0) + rrf_score * 1.0

        # Sort by fused score descending
        sorted_results = sorted(
            fused_scores.items(),
            key=lambda x: x[1],
            reverse=True
        )

        return sorted_results

    async def search(
        self,
        query: str,
        limit: int = 5,
        category: Optional[str] = None,
        threshold: Optional[float] = None
    ) -> List[EnrichedNewsItemDTO]:
        """
        Execute hybrid search:
        1. SQLite FTS for exact keyword matches
        2. ChromaDB for semantic similarity
        3. RRF fusion
        """
        # Phase 1: Keyword search (FTS)
        fts_ids = await self.sqlite.search_fts(query, limit * 2, category)

        # Phase 2: Semantic search
        semantic_results = await self.chroma.search(
            query, limit=limit * 2, category=category, threshold=threshold
        )

        # Phase 3: RRF Fusion
        keyword_scores = [(id, 1.0) for id in fts_ids]  # FTS doesn't give scores
        semantic_scores = [
            (self._get_doc_id(item), self._compute_adaptive_score(item))
            for item in semantic_results
        ]

        fused = self.fuse(keyword_scores, semantic_scores)

        # Retrieve full DTOs in fused order
        final_results = []
        for doc_id, _ in fused[:limit]:
            dto = await self.chroma.get_by_id(doc_id)
            if dto:
                final_results.append(dto)

        return final_results

    def _compute_adaptive_score(self, item: EnrichedNewsItemDTO) -> float:
        """
        Compute adaptive score combining relevance and recency.
        More recent items with same relevance get slight boost.
        """
        # Normalize relevance to 0-1
        relevance_norm = item.relevance_score / 10.0

        # Days since publication (exponential decay)
        days_old = (datetime.now() - item.timestamp).days
        recency_decay = 0.95 ** days_old

        return relevance_norm * 0.7 + recency_decay * 0.3

3.5 Phase 4: Stats Cache with Invalidation (Week 5)

Objective: Eliminate O(n) stats computation via incremental updates.

# src/storage/stats_cache.py

from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List
import asyncio
import json

@dataclass
class StatsCache:
    """Thread-safe statistics cache with incremental updates."""

    total_count: int = 0
    category_counts: Dict[str, int] = field(default_factory=dict)
    source_counts: Dict[str, int] = field(default_factory=dict)
    anomaly_counts: Dict[str, int] = field(default_factory=dict)
    last_updated: datetime = field(default_factory=datetime.now)
    _lock: asyncio.Lock = field(default_factory=asyncio.Lock)

    async def invalidate(self):
        """Invalidate cache - marks for recomputation."""
        async with self._lock:
            self.total_count = -1  # Sentinel value
            self.category_counts = {}
            self.source_counts = {}
            self.anomaly_counts = {}

    async def apply_delta(self, delta: "StatsDelta"):
        """
        Apply incremental update without full recomputation.
        O(1) operation.
        """
        async with self._lock:
            self.total_count += delta.total_delta

            for cat, count in delta.category_deltas.items():
                self.category_counts[cat] = self.category_counts.get(cat, 0) + count

            for source, count in delta.source_deltas.items():
                self.source_counts[source] = self.source_counts.get(source, 0) + count

            for anomaly, count in delta.anomaly_deltas.items():
                self.anomaly_counts[anomaly] = self.anomaly_counts.get(anomaly, 0) + count

            self.last_updated = datetime.now()

    def is_valid(self) -> bool:
        """Check if cache is populated."""
        return self.total_count >= 0

@dataclass
class StatsDelta:
    """Delta for incremental stats update."""

    total_delta: int = 0
    category_deltas: Dict[str, int] = field(default_factory=dict)
    source_deltas: Dict[str, int] = field(default_factory=dict)
    anomaly_deltas: Dict[str, int] = field(default_factory=dict)

    @classmethod
    def from_item_add(cls, item: EnrichedNewsItemDTO) -> "StatsDelta":
        """Create delta for adding an item."""
        return cls(
            total_delta=1,
            category_deltas={item.category: 1},
            source_deltas={item.source: 1},
            anomaly_deltas={a.value: 1 for a in item.anomalies_detected}
        )

# Integration with storage
class CachedVectorStore(IVectorStore):
    """VectorStore with incremental stats cache."""

    def __init__(self, ...):
        self.stats_cache = StatsCache()
        # ...

    async def store(self, item: EnrichedNewsItemDTO) -> str:
        # Store item
        item_id = await self._store_impl(item)

        # Update cache incrementally
        delta = StatsDelta.from_item_add(item)
        await self.stats_cache.apply_delta(delta)

        return item_id

    async def get_stats(self) -> Dict[str, Any]:
        """Get stats - O(1) from cache if valid."""
        if self.stats_cache.is_valid():
            return {
                "total_count": self.stats_cache.total_count,
                "category_counts": self.stats_cache.category_counts,
                "source_counts": self.stats_cache.source_counts,
                "anomaly_counts": self.stats_cache.anomaly_counts,
                "last_updated": self.stats_cache.last_updated
            }

        # Fall back to full recomputation (still via SQLite)
        return await self._recompute_stats()

4. Backward Compatibility Strategy

4.1 Interface Compatibility

The IVectorStore interface remains unchanged. New methods have default implementations:

class IVectorStore(IStoreCommand, IStoreQuery):
    """Backward-compatible interface."""

    async def get_latest(
        self,
        limit: int = 10,
        category: Optional[str] = None,
        offset: int = 0  # NEW: Optional pagination
    ) -> List[EnrichedNewsItemDTO]:
        """
        Default implementation maintains legacy behavior.
        Override in implementation for optimized path.
        """
        return await self._default_get_latest(limit, category)

    async def get_top_ranked(
        self,
        limit: int = 10,
        category: Optional[str] = None,
        offset: int = 0  # NEW: Optional pagination
    ) -> List[EnrichedNewsItemDTO]:
        """Default implementation maintains legacy behavior."""
        return await self._default_get_top_ranked(limit, category)

4.2 DTO Compatibility Layer

# src/storage/compat.py

class DTOConverter:
    """
    Converts between legacy and normalized DTO formats.
    Ensures smooth transition.
    """

    @staticmethod
    def normalize_anomalies(
        anomalies: Union[List[str], str]
    ) -> List[AnomalyType]:
        """
        Handle both legacy comma-joined strings and new list format.
        """
        if isinstance(anomalies, str):
            # Legacy format: "WebGPU,NPU acceleration"
            return [AnomalyType.from_string(a) for a in anomalies.split(",") if a]
        return [AnomalyType.from_string(a) if isinstance(a, str) else a for a in anomalies]

    @staticmethod
    def to_legacy_dict(dto: EnrichedNewsItemDTO) -> Dict[str, Any]:
        """
        Convert normalized DTO to legacy format for API compatibility.
        """
        return {
            **dto.model_dump(),
            # Legacy comma-joined format for API consumers
            "anomalies_detected": ",".join(a.value if isinstance(a, AnomalyType) else str(a)
                                          for a in dto.anomalies_detected)
        }

4.3 Dual-Mode Storage

# src/storage/adaptive_store.py

class AdaptiveVectorStore(IVectorStore):
    """
    Storage adapter that switches between legacy and normalized
    based on feature flags.
    """

    def __init__(
        self,
        legacy_store: ChromaStore,
        normalized_store: NormalizedChromaStore,
        feature_flags: FeatureFlags
    ):
        self.legacy = legacy_store
        self.normalized = normalized_store
        self.flags = feature_flags

    async def get_latest(
        self,
        limit: int = 10,
        category: Optional[str] = None,
        offset: int = 0
    ) -> List[EnrichedNewsItemDTO]:
        if is_enabled(FeatureFlags.NORMALIZED_STORAGE):
            return await self.normalized.get_latest(limit, category, offset)
        return await self.legacy.get_latest(limit, category)

    # Similar delegation for all methods...

5. Risk Mitigation

5.1 Rollback Plan

┌─────────────────────────────────────────────────────────────────┐
│                    ROLLBACK DECISION TREE                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Phase 1 Complete                                              │
│   ├── Feature Flag: NORMALIZED_STORAGE = ON                   │
│   ├── Dual-write active                                        │
│   └── Can rollback: YES (disable flag, use legacy)             │
│                                                                  │
│   Phase 2 Complete                                              │
│   ├── SQLite indexes active                                    │
│   ├── ChromaDB still primary                                   │
│   └── Can rollback: YES (disable flag, rebuild legacy indexes) │
│                                                                  │
│   Phase 3 Complete                                              │
│   ├── Hybrid search active                                     │
│   ├── FTS data populated                                       │
│   └── Can rollback: YES (disable flag, purge FTS triggers)    │
│                                                                  │
│   Phase 4 Complete                                              │
│   ├── Stats cache active                                        │
│   └── Can rollback: YES (invalidate cache, use legacy)        │
│                                                                  │
│   ⚠️  CRITICAL: After Phase 4, data divergence makes          │
│       clean rollback impossible. Requires data sync tool.     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

5.2 Data Validation Strategy

# tests/migration/test_data_integrity.py

import pytest
from datetime import datetime, timezone
from src.processor.dto import EnrichedNewsItemDTO
from src.storage.validators import DataIntegrityValidator

class TestDataIntegrity:
    """
    Comprehensive data integrity tests for migration.
    Run after each phase to validate.
    """

    @pytest.fixture
    def validator(self):
        return DataIntegrityValidator()

    @pytest.mark.asyncio
    async def test_anomaly_roundtrip(self, validator):
        """Test anomaly normalization roundtrip."""
        original_anomalies = ["WebGPU", "NPU acceleration", "Edge AI"]

        dto = EnrichedNewsItemDTO(
            ...,
            anomalies_detected=original_anomalies
        )

        # Store
        item_id = await storage.store(dto)

        # Retrieve
        retrieved = await storage.get_by_id(item_id)

        # Validate normalized form
        validator.assert_anomalies_match(
            retrieved.anomalies_detected,
            original_anomalies
        )

    @pytest.mark.asyncio
    async def test_legacy_compat_mode(self, validator):
        """
        Test that legacy comma-joined strings still work.
        Simulates old client sending comma-joined anomalies.
        """
        # Simulate legacy input
        legacy_metadata = {
            "title": "Test",
            "url": "https://example.com/legacy",
            "content_text": "Content",
            "source": "LegacySource",
            "timestamp": "2024-01-01T00:00:00",
            "relevance_score": 5,
            "summary_ru": "Резюме",
            "category": "Tech",
            "anomalies_detected": "WebGPU,Edge AI"  # Legacy format
        }

        # Should be automatically normalized
        dto = DTOConverter.legacy_metadata_to_dto(legacy_metadata)
        validator.assert_valid_dto(dto)

    @pytest.mark.asyncio
    async def test_stats_consistency(self, validator):
        """
        Test that stats from cache match actual data.
        Validates incremental update correctness.
        """
        # Add items
        await storage.store(item1)
        await storage.store(item2)

        # Get cached stats
        cached_stats = await storage.get_stats(use_cache=True)

        # Get actual counts
        actual_stats = await storage.get_stats(use_cache=False)

        validator.assert_stats_match(cached_stats, actual_stats)

    @pytest.mark.asyncio
    async def test_pagination_consistency(self, validator):
        """
        Test that pagination doesn't skip or duplicate items.
        """
        items = [create_test_item(i) for i in range(100)]
        for item in items:
            await storage.store(item)

        # Get first page
        page1 = await storage.get_latest(limit=10, offset=0)

        # Get second page
        page2 = await storage.get_latest(limit=10, offset=10)

        # Validate no overlap
        validator.assert_no_overlap(page1, page2)

        # Validate total count
        total = await storage.get_stats()
        assert sum(len(p) for p in [page1, page2]) <= total["total_count"]

5.3 Monitoring and Alerts

# src/monitoring/migration_health.py

from dataclasses import dataclass
from datetime import datetime
from typing import Dict, Any

@dataclass
class MigrationHealthCheck:
    """Health checks for migration progress."""

    phase: int
    checks: Dict[str, bool]

    def is_healthy(self) -> bool:
        return all(self.checks.values())

    def to_report(self) -> Dict[str, Any]:
        return {
            "phase": self.phase,
            "healthy": self.is_healthy(),
            "checks": self.checks,
            "timestamp": datetime.now().isoformat()
        }

class MigrationMonitor:
    """Monitor migration health and emit alerts."""

    async def check_phase1_health(self) -> MigrationHealthCheck:
        """Phase 1 health checks."""
        checks = {
            "dual_write_active": await self._check_dual_write(),
            "anomaly_normalized": await self._check_anomaly_normalization(),
            "no_data_loss": await self._check_data_integrity(),
            "backward_compat": await self._check_compat_mode(),
        }
        return MigrationHealthCheck(phase=1, checks=checks)

    async def _check_data_integrity(self) -> bool:
        """
        Sample 100 items and verify normalized vs legacy match.
        """
        legacy_store = ChromaStore(...)
        normalized_store = NormalizedChromaStore(...)

        sample_ids = await legacy_store._sample_ids(100)

        for item_id in sample_ids:
            legacy_item = await legacy_store.get_by_id(item_id)
            normalized_item = await normalized_store.get_by_id(item_id)

            if not self._items_match(legacy_item, normalized_item):
                return False

        return True

    def _items_match(
        self,
        legacy: EnrichedNewsItemDTO,
        normalized: EnrichedNewsItemDTO
    ) -> bool:
        """Verify legacy and normalized items are semantically equal."""
        return (
            legacy.title == normalized.title and
            legacy.url == normalized.url and
            legacy.relevance_score == normalized.relevance_score and
            set(normalized.anomalies_detected) == set(legacy.anomalies_detected)
        )

6. Performance Targets

6.1 Benchmarks

Operation	Current	Phase 1	Phase 2	Phase 3	Phase 4
`get_latest` (10 items)	500ms	100ms	10ms	10ms	10ms
`get_latest` (100 items)	800ms	200ms	20ms	20ms	20ms
`get_top_ranked` (10)	500ms	100ms	10ms	10ms	10ms
`get_stats`	200ms	50ms	20ms	20ms	1ms
`search` (keyword)	50ms	50ms	50ms	10ms	10ms
`search` (semantic)	100ms	100ms	100ms	50ms	50ms
`search` (hybrid)	N/A	N/A	N/A	60ms	30ms
`store` (single)	10ms	15ms	15ms	15ms	15ms
Memory (1000 items)	O(n)	O(n)	O(1)	O(1)	O(1)

6.2 Scaling Expectations

┌─────────────────────────────────────────────────────────────────┐
│                    PERFORMANCE vs DATASET SIZE                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  get_latest (limit=10)                                         │
│                                                                  │
│  Current:  O(n)                                                  │
│  ├─ 100 items: 50ms                                              │
│  ├─ 1,000 items: 500ms                                          │
│  ├─ 10,000 items: 5,000ms (5s) ⚠️                                │
│  └─ 100,000 items: 50,000ms (50s) 🚫                             │
│                                                                  │
│  Optimized (SQLite indexed): O(log n + k)                        │
│  ├─ 100 items: 5ms                                               │
│  ├─ 1,000 items: 7ms                                             │
│  ├─ 10,000 items: 10ms                                           │
│  ├─ 100,000 items: 15ms                                          │
│  └─ 1,000,000 items: 25ms                                         │
│                                                                  │
│  Target: Support 1M+ items with <50ms latency                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

7. Implementation Phases

Phase 0: Infrastructure (Week 1)

Dependencies: None
Owner: Senior Architecture Engineer

Create feature flags system
Implement DualWriter class
Create migration test suite
Set up SQLite database initialization
Create AnomalyType enum
Implement DTOConverter for backward compat

Deliverables:

src/config/feature_flags.py
src/storage/dual_writer.py
src/processor/anomaly_types.py
src/storage/sqlite_store.py (schema only)
tests/migrations/

Exit Criteria:

All phase 0 tests pass
Feature flags system operational
Dual-write writes to both stores

Phase 1: Normalized Storage (Week 2)

Dependencies: Phase 0
Owner: Data Engineer

Implement NormalizedChromaStore
Create anomaly junction collection
Implement anomaly normalization
Update IVectorStore interface with new signatures
Create AdaptiveVectorStore for feature-flag switching
Write integration tests

Deliverables:

src/storage/normalized_chroma_store.py
src/storage/adaptive_store.py
Updated IVectorStore interface

Exit Criteria:

Normalized storage stores/retrieves correctly
Anomaly enum normalization works
Dual-write to legacy + normalized active
All tests pass with both stores

Phase 2: Indexed Queries (Week 3)

Dependencies: Phase 1
Owner: Backend Architect

Complete SQLite schema with indexes
Implement FTS5 virtual table and triggers
Implement get_latest_indexed() using SQLite
Implement get_top_ranked_indexed() using SQLite
Update ChromaStore to delegate to SQLite
Implement pagination (offset/limit)

Deliverables:

Complete SQLiteStore implementation
ChromaStore updated with indexed queries
Pagination support in interface

Exit Criteria:

get_latest uses SQLite index (verify with EXPLAIN)
get_top_ranked uses SQLite index
Pagination works correctly
No full-collection scans in hot path

Phase 3: Hybrid Search (Week 4)

Dependencies: Phase 2
Owner: Backend Architect

Implement HybridSearchStrategy with RRF
Integrate SQLite FTS with ChromaDB semantic
Add adaptive scoring (relevance + recency)
Implement search_stream() for large results
Performance test hybrid vs legacy

Deliverables:

src/storage/hybrid_search.py
Updated ChromaStore.search()

Exit Criteria:

Hybrid search returns better results than pure semantic
Keyword matches appear first
RRF fusion working correctly
Threshold filtering applies to semantic results

Phase 4: Stats Cache (Week 5)

Dependencies: Phase 3
Owner: Data Engineer

Implement StatsCache class with incremental updates
Integrate cache with AdaptiveVectorStore
Implement cache invalidation triggers
Add cache warming on startup
Performance test get_stats()

Deliverables:

src/storage/stats_cache.py
Updated storage layer with cache

Exit Criteria:

get_stats returns in <5ms (cached)
Cache invalidates correctly on store/delete
Cache rebuilds correctly on demand

Phase 5: Validation & Cutover (Week 6)

Dependencies: All previous phases
Owner: QA Engineer

Run full migration test suite
Performance benchmark comparison
Load test with simulated traffic
Validate all Telegram bot commands work
Generate migration completion report
Disable legacy code paths (optional)

Deliverables:

Migration completion report
Performance benchmark report
Legacy code removal (if approved)

Exit Criteria:

All tests pass
Performance targets met
No regression in bot functionality
Stakeholder sign-off

8. Open Questions

8.1 Data Migration

Question	Impact	Priority
Should we migrate existing comma-joined anomalies to normalized junction records?	Data integrity	HIGH
What is the expected dataset size at migration time?	Performance planning	HIGH
Can we have downtime for the migration, or must it be zero-downtime?	Rollback strategy	HIGH
Should we keep legacy ChromaDB collection or archive it?	Storage costs	MEDIUM

8.2 Architecture

Question	Impact	Priority
Do we want to eventually migrate away from ChromaDB to a more scalable solution (Qdrant, Weaviate)?	Future planning	MEDIUM
Should anomalies be stored in ChromaDB or only in SQLite?	Consistency model	MEDIUM
Do we need multi-tenancy support for multiple Telegram channels?	Future feature	LOW

8.3 Operations

Question	Impact	Priority
What is the acceptable cache staleness for `get_stats`?	Consistency vs performance	HIGH
Should we implement TTL for old items?	Storage management	MEDIUM
Do we need backup/restore procedures for SQLite?	Disaster recovery	HIGH

8.4 Testing

Question	Impact	Priority
Should we implement chaos testing for dual-write failure modes?	Reliability	MEDIUM
What is the acceptable test coverage threshold?	Quality	HIGH
Do we need integration tests with real ChromaDB instance?	Confidence	HIGH

9. Appendix

A. File Structure After Migration

src/
├── config/
│   ├── __init__.py
│   ├── feature_flags.py       # NEW
│   └── settings.py
├── storage/
│   ├── __init__.py
│   ├── base.py                 # UPDATED: Evolved interface
│   ├── chroma_store.py          # UPDATED: Delegates to SQLite
│   ├── normalized_chroma_store.py  # NEW
│   ├── sqlite_store.py         # NEW
│   ├── hybrid_search.py        # NEW
│   ├── stats_cache.py          # NEW
│   ├── adaptive_store.py       # NEW
│   ├── dual_writer.py          # NEW
│   ├── compat.py               # NEW
│   └── migrations/             # NEW
│       ├── __init__.py
│       ├── v1_normalize_anomalies.py
│       └── v2_add_indexes.py
├── processor/
│   ├── __init__.py
│   ├── dto.py                  # UPDATED: New DTOs
│   ├── anomaly_types.py        # NEW
│   └── ...
├── crawlers/
│   ├── __init__.py
│   └── ...
├── bot/
│   ├── __init__.py
│   ├── handlers.py             # UPDATED: Use new pagination
│   └── ...
└── main.py                     # UPDATED: Initialize new stores

B. ChromaDB Version Consideration

Note: The current implementation uses ChromaDB's upsert with metadata. ChromaDB has limitations:

No native sorting by metadata fields

No native pagination (offset/limit)

where clause filtering is limited

The SQLite shadow database addresses these limitations while maintaining ChromaDB for vector operations.

C. Monitoring Queries for Production

-- Check SQLite index usage
EXPLAIN QUERY PLAN
SELECT * FROM news_items
ORDER BY timestamp DESC
LIMIT 10;

-- Check FTS index
EXPLAIN QUERY PLAN
SELECT * FROM news_fts
WHERE news_fts MATCH 'WebGPU';

-- Check cache freshness
SELECT key, total_count, last_updated
FROM stats_cache;

-- Check anomaly distribution
SELECT a.name, COUNT(*) as count
FROM news_anomalies na
JOIN anomaly_types a ON na.anomaly_id = a.id
GROUP BY a.name
ORDER BY count DESC;

10. Approval

Role	Name	Date	Signature
Senior Architecture Engineer
Project Manager
QA Lead
DevOps Lead

Document generated for team implementation review.

61 KiB Raw Blame History

Trend-Scout AI: Database Schema Migration Plan

1. Executive Summary

Current Pain Points

Expected Outcomes

2. Target Architecture

2.1 Multi-Collection Design

2.2 SQLite Shadow Database (for FTS and relational queries)

2.3 New DTOs

3. Migration Strategy

3.1 Phase 0: Preparation (Week 1)

3.1.1 Create Feature Flags System

3.1.2 Implement Shadow Write (Dual-Write Pattern)

3.1.3 Create Migration Test Suite

3.2 Phase 1: Normalized Storage (Week 2)

3.2.1 New Normalized ChromaStore

3.2.2 AnomalyType Enum

3.3 Phase 2: Indexed Queries (Week 3)

3.3.1 SQLite Shadow Database

3.3.2 Update ChromaStore to Use SQLite Index

3.4 Phase 3: Hybrid Search with RRF (Week 4)

3.5 Phase 4: Stats Cache with Invalidation (Week 5)

4. Backward Compatibility Strategy

4.1 Interface Compatibility

4.2 DTO Compatibility Layer

4.3 Dual-Mode Storage

5. Risk Mitigation

5.1 Rollback Plan

5.2 Data Validation Strategy

5.3 Monitoring and Alerts

6. Performance Targets

6.1 Benchmarks

6.2 Scaling Expectations

7. Implementation Phases

Phase 0: Infrastructure (Week 1)

Phase 1: Normalized Storage (Week 2)

Phase 2: Indexed Queries (Week 3)

Phase 3: Hybrid Search (Week 4)

Phase 4: Stats Cache (Week 5)

Phase 5: Validation & Cutover (Week 6)

8. Open Questions

8.1 Data Migration

8.2 Architecture

8.3 Operations

8.4 Testing

9. Appendix

A. File Structure After Migration

B. ChromaDB Version Consideration

C. Monitoring Queries for Production

10. Approval

61 KiB

Raw Blame History