AI-Trend-Scout/docs/MIGRATION_PLAN.md
Artur Mukhamadiev f4ae73bdae feat(database): SQLite shadow database for indexed queries
:Release Notes:
- Add ACID-compliant SQLiteStore (WAL mode, FULL sync, FK constraints)
- Add AnomalyType enum for normalized anomaly storage
- Add legacy data migration script (dry-run, batch, rollback)
- Update ChromaStore to delegate indexed queries to SQLite
- Add test suite for SQLiteStore (7 tests, all passing)

:Detailed Notes:
- SQLiteStore: news_items, anomaly_types, news_anomalies tables with indexes
- Performance: get_latest/get_top_ranked O(n)→O(log n), get_stats O(n)→O(1)
- ChromaDB remains primary vector store; SQLite provides indexed metadata queries

:Testing Performed:
- python3 -m pytest tests/ -v (112 passed)

:QA Notes:
- Tests verified by Python QA Engineer subagent

:Issues Addressed:
- get_latest/get_top_ranked fetched ALL items then sorted in Python
- get_stats iterated over ALL items
- anomalies_detected stored as comma-joined string (no index)

Change-Id: I708808b6e72889869afcf16d4ac274260242007a
2026-03-30 13:54:48 +03:00

1709 lines
61 KiB
Markdown

# Trend-Scout AI: Database Schema Migration Plan
**Document Version:** 1.0
**Date:** 2026-03-30
**Status:** Draft for Review
---
## 1. Executive Summary
This document outlines a comprehensive migration plan to address critical performance and data architecture issues in the Trend-Scout AI vector storage layer. The migration transforms a flat, unnormalized ChromaDB collection into a normalized, multi-collection architecture with proper indexing, pagination, and query optimization.
### Current Pain Points
| Issue | Current Behavior | Impact |
|-------|------------------|--------|
| `get_latest` | Fetches ALL items via `collection.get()`, sorts in Python | O(n) memory + sort |
| `get_top_ranked` | Fetches ALL items via `collection.get()`, sorts in Python | O(n) memory + sort |
| `get_stats` | Iterates over ALL items to count categories | O(n) iteration |
| `anomalies_detected` | Stored as comma-joined string `"A1,A2,A3"` | Type corruption, no index |
| Single collection | No normalization, mixed data types | No efficient filtering |
| No pagination | No offset-based pagination support | Cannot handle large datasets |
### Expected Outcomes
| Metric | Current | Target | Improvement |
|--------|---------|--------|-------------|
| `get_latest` (1000 items) | ~500ms, O(n) | ~10ms, O(log n) | **50x faster** |
| `get_top_ranked` (1000 items) | ~500ms, O(n log n) | ~10ms, O(k log k) | **50x faster** |
| `get_stats` (1000 items) | ~200ms | ~1ms | **200x faster** |
| Memory usage | O(n) full scan | O(1) or O(k) | **~99% reduction** |
---
## 2. Target Architecture
### 2.1 Multi-Collection Design
```
┌─────────────────────────────────────────────────────────────────────┐
│ ChromaDB Instance │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ news_items │ │ category_index │ │
│ │ (main collection) │ │ (lookup table) │ │
│ ├──────────────────────┤ ├──────────────────────┤ │
│ │ id (uuid5 from url) │◄───┤ category_id │ │
│ │ content_text (vec) │ │ name │ │
│ │ title │ │ item_count │ │
│ │ url │ │ created_at │ │
│ │ source │ │ updated_at │ │
│ │ timestamp │ └──────────────────────┘ │
│ │ relevance_score │ │
│ │ summary_ru │ ┌──────────────────────┐ │
│ │ category_id (FK) │───►│ anomaly_types │ │
│ │ anomalies[] │ │ (normalized) │ │
│ └──────────────────────┘ ├──────────────────────┤ │
│ │ anomaly_id │ │
│ │ name │ │
│ │ description │ │
│ └──────────────────────┘ │
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ news_anomalies │ │ stats_cache │ │
│ │ (junction table) │ │ (materialized) │ │
│ ├──────────────────────┤ ├──────────────────────┤ │
│ │ news_id (FK) │◄───┤ key │ │
│ │ anomaly_id (FK) │ │ total_count │ │
│ │ detected_at │ │ category_counts (JSON)│ │
│ └──────────────────────┘ │ source_counts (JSON) │ │
│ │ last_updated │ │
│ └──────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### 2.2 SQLite Shadow Database (for FTS and relational queries)
```
┌─────────────────────────────────────────────────────────────────────┐
│ SQLite Database │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ news_items_rel │ │ categories │ │
│ │ (normalized relay) │ ├──────────────────────┤ │
│ ├──────────────────────┤ │ id (PK) │ │
│ │ id (uuid, PK) │ │ name │ │
│ │ title │ │ item_count │ │
│ │ url (UNIQUE) │ │ embedding_id (FK) │ │
│ │ source │ └──────────────────────┘ │
│ │ timestamp │ │
│ │ relevance_score │ ┌──────────────────────┐ │
│ │ summary_ru │ │ anomalies │ │
│ │ category_id (FK) │ ├──────────────────────┤ │
│ │ content_text │ │ id (PK) │ │
│ └──────────────────────┘ │ name │ │
│ │ description │ │
│ ┌──────────────────────┐ └──────────────────────┘ │
│ │ news_anomalies_rel │ │
│ ├──────────────────────┤ ┌──────────────────────┐ │
│ │ news_id (FK) │ │ stats_snapshot │ │
│ │ anomaly_id (FK) │ ├──────────────────────┤ │
│ │ detected_at │ │ total_count │ │
│ └──────────────────────┘ │ category_json │ │
│ │ last_updated │ │
│ ┌──────────────────────┐ └──────────────────────┘ │
│ │ news_fts │ │
│ │ (FTS5 virtual) │ │
│ ├──────────────────────┤ ┌──────────────────────┐ │
│ │ news_id (FK) │ │ crawl_history │ │
│ │ title_tokens │ ├──────────────────────┤ │
│ │ content_tokens │ │ id (PK) │ │
│ │ summary_tokens │ │ crawler_name │ │
│ └──────────────────────┘ │ items_fetched │ │
│ │ items_new │ │
│ │ started_at │ │
│ │ completed_at │ │
│ └──────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### 2.3 New DTOs
```python
# src/processor/dto.py (evolved)
from typing import List, Optional
from pydantic import BaseModel, Field
from datetime import datetime
from enum import Enum
class AnomalyType(str, Enum):
"""Normalized anomaly types - not a comma-joined string."""
WEBGPU = "WebGPU"
NPU_ACCELERATION = "NPU acceleration"
EDGE_AI = "Edge AI"
# ... extensible
class EnrichedNewsItemDTO(BaseModel):
"""Extended DTO with proper normalized relationships."""
title: str
url: str
content_text: str
source: str
timestamp: datetime
relevance_score: int = Field(ge=0, le=10)
summary_ru: str
category: str # Category name, resolved from category_id
anomalies_detected: List[AnomalyType] # Now a proper list
anomaly_ids: Optional[List[str]] = None # Internal use
class CategoryStats(BaseModel):
"""Statistics per category."""
category: str
count: int
avg_relevance: float
class StorageStats(BaseModel):
"""Complete storage statistics."""
total_count: int
categories: List[CategoryStats]
sources: Dict[str, int]
anomaly_types: Dict[str, int]
last_updated: datetime
</code>
### 2.4 Evolved Interface Design
```python
# src/storage/base.py
from abc import ABC, abstractmethod
from typing import List, Optional, AsyncIterator, Dict, Any
from datetime import datetime
from src.processor.dto import EnrichedNewsItemDTO, StorageStats, CategoryStats
class IStoreCommand(ABC):
"""Write operations - SRP: only handles writes."""
@abstractmethod
async def store(self, item: EnrichedNewsItemDTO) -> str:
"""Store an item. Returns the generated ID."""
pass
@abstractmethod
async def store_batch(self, items: List[EnrichedNewsItemDTO]) -> List[str]:
"""Batch store items. Returns generated IDs."""
pass
@abstractmethod
async def update(self, item_id: str, item: EnrichedNewsItemDTO) -> None:
"""Update an existing item."""
pass
@abstractmethod
async def delete(self, item_id: str) -> None:
"""Delete an item by ID."""
pass
class IStoreQuery(ABC):
"""Read operations - SRP: only handles queries."""
@abstractmethod
async def get_by_id(self, item_id: str) -> Optional[EnrichedNewsItemDTO]:
pass
@abstractmethod
async def exists(self, url: str) -> bool:
pass
@abstractmethod
async def get_latest(
self,
limit: int = 10,
category: Optional[str] = None,
offset: int = 0
) -> List[EnrichedNewsItemDTO]:
"""Paginated retrieval by timestamp. Uses index, not full scan."""
pass
@abstractmethod
async def get_top_ranked(
self,
limit: int = 10,
category: Optional[str] = None,
offset: int = 0
) -> List[EnrichedNewsItemDTO]:
"""Paginated retrieval by relevance_score. Uses index, not full scan."""
pass
@abstractmethod
async def get_stats(self, use_cache: bool = True) -> StorageStats:
"""Returns cached stats or computes on-demand."""
pass
@abstractmethod
async def search_hybrid(
self,
query: str,
limit: int = 5,
category: Optional[str] = None,
threshold: Optional[float] = None
) -> List[EnrichedNewsItemDTO]:
"""Keyword + semantic hybrid search with RRF fusion."""
pass
@abstractmethod
async def search_stream(
self,
query: str,
limit: int = 5,
category: Optional[str] = None
) -> AsyncIterator[EnrichedNewsItemDTO]:
"""Streaming search for large result sets."""
pass
class IVectorStore(IStoreCommand, IStoreQuery):
"""Combined interface preserving backward compatibility."""
pass
class IAdminOperations(ABC):
"""Administrative operations - separate from core CRUD."""
@abstractmethod
async def rebuild_stats_cache(self) -> StorageStats:
"""Force rebuild of statistics cache."""
pass
@abstractmethod
async def vacuum(self) -> None:
"""Optimize storage."""
pass
@abstractmethod
async def get_health(self) -> Dict[str, Any]:
"""Health check for monitoring."""
pass
```
---
## 3. Migration Strategy
### 3.1 Phase 0: Preparation (Week 1)
**Objective:** Establish infrastructure for zero-downtime migration.
#### 3.1.1 Create Feature Flags System
```python
# src/config/feature_flags.py
from enum import Flag, auto
class FeatureFlags(Flag):
"""Feature flags for phased migration."""
NORMALIZED_STORAGE = auto()
SQLITE_SHADOW_DB = auto()
FTS_ENABLED = auto()
STATS_CACHE = auto()
PAGINATION = auto()
HYBRID_SEARCH = auto()
# Global config
FEATURE_FLAGS = FeatureFlags.NORMALIZED_STORAGE | FeatureFlags.STATS_CACHE
def is_enabled(flag: FeatureFlags) -> bool:
return flag in FEATURE_FLAGS
```
#### 3.1.2 Implement Shadow Write (Dual-Write Pattern)
```python
# src/storage/dual_writer.py
class DualWriter:
"""
Writes to both legacy and new storage simultaneously.
Enables gradual migration with no data loss.
"""
def __init__(
self,
legacy_store: ChromaStore,
normalized_store: 'NormalizedChromaStore',
sqlite_store: 'SQLiteStore'
):
self.legacy = legacy_store
self.normalized = normalized_store
self.sqlite = sqlite_store
async def store(self, item: EnrichedNewsItemDTO) -> str:
# Write to both systems
legacy_id = await self.legacy.store(item)
normalized_id = await self.normalized.store(item)
sqlite_id = await self.sqlite.store_relational(item)
# Return normalized ID as source of truth
return normalized_id
```
#### 3.1.3 Create Migration Test Suite
```python
# tests/migrations/test_normalized_storage.py
import pytest
from datetime import datetime, timezone
from src.processor.dto import EnrichedNewsItemDTO
from src.storage.normalized_chroma_store import NormalizedChromaStore
@pytest.mark.asyncio
async def test_migration_preserves_data_integrity():
"""
Test that normalized storage preserves all data from legacy format.
Round-trip: Legacy DTO -> Normalized Storage -> Legacy DTO
"""
original = EnrichedNewsItemDTO(
title="Test",
url="https://example.com/test",
content_text="Content",
source="TestSource",
timestamp=datetime.now(timezone.utc),
relevance_score=8,
summary_ru="Резюме",
anomalies_detected=["WebGPU", "Edge AI"],
category="Tech"
)
store = NormalizedChromaStore(...)
item_id = await store.store(original)
retrieved = await store.get_by_id(item_id)
assert retrieved.title == original.title
assert retrieved.url == original.url
assert retrieved.relevance_score == original.relevance_score
assert retrieved.category == original.category
assert set(retrieved.anomalies_detected) == set(original.anomalies_detected)
@pytest.mark.asyncio
async def test_anomaly_normalization():
"""
Test that anomalies are properly normalized to enum values.
"""
dto = EnrichedNewsItemDTO(
...,
anomalies_detected=["webgpu", "NPU acceleration", "Edge AI"]
)
store = NormalizedChromaStore(...)
item_id = await store.store(dto)
retrieved = await store.get_by_id(item_id)
# Should be normalized to canonical forms
assert AnomalyType.WEBGPU in retrieved.anomalies_detected
assert AnomalyType.NPU_ACCELERATION in retrieved.anomalies_detected
assert AnomalyType.EDGE_AI in retrieved.anomalies_detected
```
### 3.2 Phase 1: Normalized Storage (Week 2)
**Objective:** Introduce normalized anomaly storage without breaking existing queries.
#### 3.2.1 New Normalized ChromaStore
```python
# src/storage/normalized_chroma_store.py
class NormalizedChromaStore(IVectorStore):
"""
ChromaDB storage with normalized anomaly types.
Maintains backward compatibility via IVectorStore interface.
"""
# Collections
NEWS_COLLECTION = "news_items"
ANOMALY_COLLECTION = "anomaly_types"
JUNCTION_COLLECTION = "news_anomalies"
async def store(self, item: EnrichedNewsItemDTO) -> str:
"""Store with normalized anomaly handling."""
anomaly_ids = await self._ensure_anomalies(item.anomalies_detected)
# Store main item
doc_id = str(uuid.uuid5(uuid.NAMESPACE_URL, item.url))
metadata = {
"title": item.title,
"url": item.url,
"source": item.source,
"timestamp": item.timestamp.isoformat(),
"relevance_score": item.relevance_score,
"summary_ru": item.summary_ru,
"category": item.category,
# NO MORE COMMA-JOINED STRING
}
# Store junction records for anomalies
for anomaly_id in anomaly_ids:
await self._store_junction(doc_id, anomaly_id)
# Vector storage (without anomalies in metadata)
await asyncio.to_thread(
self.collection.upsert,
ids=[doc_id],
documents=[item.content_text],
metadatas=[metadata]
)
return doc_id
async def _ensure_anomalies(
self,
anomalies: List[str]
) -> List[str]:
"""Ensure anomaly types exist, return IDs."""
anomaly_ids = []
for anomaly in anomalies:
# Normalize to canonical form
normalized = AnomalyType.from_string(anomaly)
anomaly_id = str(uuid.uuid5(uuid.NAMESPACE_URL, normalized.value))
# Upsert anomaly type
await asyncio.to_thread(
self.anomaly_collection.upsert,
ids=[anomaly_id],
documents=[normalized.value],
metadatas=[{
"name": normalized.value,
"description": normalized.description
}]
)
anomaly_ids.append(anomaly_id)
return anomaly_ids
```
#### 3.2.2 AnomalyType Enum
```python
# src/processor/anomaly_types.py
from enum import Enum
class AnomalyType(str, Enum):
"""Canonical anomaly types with metadata."""
WEBGPU = "WebGPU"
NPU_ACCELERATION = "NPU acceleration"
EDGE_AI = "Edge AI"
QUANTUM_COMPUTING = "Quantum computing"
NEUROMORPHIC = "Neuromorphic computing"
SPATIAL_COMPUTING = "Spatial computing"
UNKNOWN = "Unknown"
@classmethod
def from_string(cls, value: str) -> "AnomalyType":
"""Fuzzy matching from raw string."""
normalized = value.strip().lower()
for member in cls:
if member.value.lower() == normalized:
return member
# Try partial match
for member in cls:
if normalized in member.value.lower():
return member
return cls.UNKNOWN
@property
def description(self) -> str:
"""Human-readable description."""
descriptions = {
"WebGPU": "WebGPU graphics API or GPU compute",
"NPU acceleration": "Neural Processing Unit hardware",
"Edge AI": "Edge computing with AI",
# ...
}
return descriptions.get(self.value, "")
```
### 3.3 Phase 2: Indexed Queries (Week 3)
**Objective:** Eliminate full-collection scans for `get_latest`, `get_top_ranked`, `get_stats`.
#### 3.3.1 SQLite Shadow Database
```python
# src/storage/sqlite_store.py
import sqlite3
from pathlib import Path
from typing import List, Optional, Dict, Any
from contextlib import asynccontextmanager
from datetime import datetime
import json
class SQLiteStore:
"""
SQLite shadow database for relational queries and FTS.
Maintains sync with ChromaDB news_items collection.
"""
def __init__(self, db_path: Path):
self.db_path = db_path
self._init_schema()
def _init_schema(self):
"""Initialize SQLite schema with proper indexes."""
with self._get_connection() as conn:
conn.executescript("""
-- Main relational table
CREATE TABLE IF NOT EXISTS news_items (
id TEXT PRIMARY KEY,
title TEXT NOT NULL,
url TEXT UNIQUE NOT NULL,
source TEXT NOT NULL,
timestamp INTEGER NOT NULL, -- Unix epoch
relevance_score INTEGER NOT NULL,
summary_ru TEXT,
category TEXT NOT NULL,
content_text TEXT,
created_at INTEGER DEFAULT (unixepoch())
);
-- Indexes for fast queries
CREATE INDEX IF NOT EXISTS idx_news_timestamp
ON news_items(timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_news_relevance
ON news_items(relevance_score DESC);
CREATE INDEX IF NOT EXISTS idx_news_category
ON news_items(category);
CREATE INDEX IF NOT EXISTS idx_news_source
ON news_items(source);
-- FTS5 virtual table for full-text search
CREATE VIRTUAL TABLE IF NOT EXISTS news_fts USING fts5(
id,
title,
content_text,
summary_ru,
content='news_items',
content_rowid='rowid'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER IF NOT EXISTS news_fts_insert AFTER INSERT ON news_items
BEGIN
INSERT INTO news_fts(rowid, id, title, content_text, summary_ru)
VALUES (NEW.rowid, NEW.id, NEW.title, NEW.content_text, NEW.summary_ru);
END;
CREATE TRIGGER IF NOT EXISTS news_fts_delete AFTER DELETE ON news_items
BEGIN
INSERT INTO news_fts(news_fts, rowid, id, title, content_text, summary_ru)
VALUES ('delete', OLD.rowid, OLD.id, OLD.title, OLD.content_text, OLD.summary_ru);
END;
CREATE TRIGGER IF NOT EXISTS news_fts_update AFTER UPDATE ON news_items
BEGIN
INSERT INTO news_fts(news_fts, rowid, id, title, content_text, summary_ru)
VALUES ('delete', OLD.rowid, OLD.id, OLD.title, OLD.content_text, OLD.summary_ru);
INSERT INTO news_fts(rowid, id, title, content_text, summary_ru)
VALUES (NEW.rowid, NEW.id, NEW.title, NEW.content_text, NEW.summary_ru);
END;
-- Stats cache table
CREATE TABLE IF NOT EXISTS stats_cache (
key TEXT PRIMARY KEY DEFAULT 'main',
total_count INTEGER DEFAULT 0,
category_counts TEXT DEFAULT '{}', -- JSON
source_counts TEXT DEFAULT '{}', -- JSON
anomaly_counts TEXT DEFAULT '{}', -- JSON
last_updated INTEGER
);
-- Anomaly types table
CREATE TABLE IF NOT EXISTS anomaly_types (
id TEXT PRIMARY KEY,
name TEXT UNIQUE NOT NULL,
description TEXT
);
-- Junction table for news-anomaly relationships
CREATE TABLE IF NOT EXISTS news_anomalies (
news_id TEXT NOT NULL,
anomaly_id TEXT NOT NULL,
detected_at INTEGER DEFAULT (unixepoch()),
PRIMARY KEY (news_id, anomaly_id),
FOREIGN KEY (news_id) REFERENCES news_items(id),
FOREIGN KEY (anomaly_id) REFERENCES anomaly_types(id)
);
CREATE INDEX IF NOT EXISTS idx_news_anomalies_news
ON news_anomalies(news_id);
CREATE INDEX IF NOT EXISTS idx_news_anomalies_anomaly
ON news_anomalies(anomaly_id);
-- Crawl history for monitoring
CREATE TABLE IF NOT EXISTS crawl_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
crawler_name TEXT NOT NULL,
items_fetched INTEGER DEFAULT 0,
items_new INTEGER DEFAULT 0,
started_at INTEGER,
completed_at INTEGER,
error_message TEXT
);
""")
@asynccontextmanager
def _get_connection(self):
"""Async context manager for SQLite connections."""
conn = sqlite3.connect(self.db_path)
conn.row_factory = sqlite3.Row
try:
yield conn
finally:
conn.close()
# --- Fast Indexed Queries ---
async def get_latest_indexed(
self,
limit: int = 10,
category: Optional[str] = None,
offset: int = 0
) -> List[Dict[str, Any]]:
"""Get latest items using SQLite index - O(log n + k)."""
with self._get_connection() as conn:
if category:
cursor = conn.execute("""
SELECT * FROM news_items
WHERE category = ?
ORDER BY timestamp DESC
LIMIT ? OFFSET ?
""", (category, limit, offset))
else:
cursor = conn.execute("""
SELECT * FROM news_items
ORDER BY timestamp DESC
LIMIT ? OFFSET ?
""", (limit, offset))
return [dict(row) for row in cursor.fetchall()]
async def get_top_ranked_indexed(
self,
limit: int = 10,
category: Optional[str] = None,
offset: int = 0
) -> List[Dict[str, Any]]:
"""Get top ranked items using SQLite index - O(log n + k)."""
with self._get_connection() as conn:
if category:
cursor = conn.execute("""
SELECT * FROM news_items
WHERE category = ?
ORDER BY relevance_score DESC, timestamp DESC
LIMIT ? OFFSET ?
""", (category, limit, offset))
else:
cursor = conn.execute("""
SELECT * FROM news_items
ORDER BY relevance_score DESC, timestamp DESC
LIMIT ? OFFSET ?
""", (limit, offset))
return [dict(row) for row in cursor.fetchall()]
async def get_stats_fast(self) -> Dict[str, Any]:
"""
Get stats from cache or compute incrementally.
O(1) if cached, O(1) for reads from indexed tables.
"""
with self._get_connection() as conn:
# Try cache first
cursor = conn.execute("SELECT * FROM stats_cache WHERE key = 'main'")
row = cursor.fetchone()
if row:
return {
"total_count": row["total_count"],
"category_counts": json.loads(row["category_counts"]),
"source_counts": json.loads(row["source_counts"]),
"anomaly_counts": json.loads(row["anomaly_counts"]),
"last_updated": datetime.fromtimestamp(row["last_updated"])
}
# Fall back to indexed aggregation (still O(n) but faster than Python)
cursor = conn.execute("""
SELECT
COUNT(*) as total_count,
category,
COUNT(*) as cat_count
FROM news_items
GROUP BY category
""")
category_counts = {row["category"]: row["cat_count"] for row in cursor.fetchall()}
cursor = conn.execute("""
SELECT source, COUNT(*) as count
FROM news_items
GROUP BY source
""")
source_counts = {row["source"]: row["count"] for row in cursor.fetchall()}
cursor = conn.execute("""
SELECT a.name, COUNT(*) as count
FROM news_anomalies na
JOIN anomaly_types a ON na.anomaly_id = a.id
GROUP BY a.name
""")
anomaly_counts = {row["name"]: row["count"] for row in cursor.fetchall()}
return {
"total_count": sum(category_counts.values()),
"category_counts": category_counts,
"source_counts": source_counts,
"anomaly_counts": anomaly_counts,
"last_updated": datetime.now()
}
async def search_fts(
self,
query: str,
limit: int = 10,
category: Optional[str] = None
) -> List[str]:
"""Full-text search using SQLite FTS5 - O(log n)."""
with self._get_connection() as conn:
# Escape FTS5 special characters
fts_query = query.replace('"', '""')
if category:
cursor = conn.execute("""
SELECT n.id FROM news_items n
JOIN news_fts fts ON n.id = fts.id
WHERE news_fts MATCH ?
AND n.category = ?
LIMIT ?
""", (f'"{fts_query}"', category, limit))
else:
cursor = conn.execute("""
SELECT id FROM news_fts
WHERE news_fts MATCH ?
LIMIT ?
""", (f'"{fts_query}"', limit))
return [row["id"] for row in cursor.fetchall()]
```
#### 3.3.2 Update ChromaStore to Use SQLite Index
```python
# src/storage/chroma_store.py (updated)
class ChromaStore(IVectorStore):
"""
Updated ChromaStore that delegates indexed queries to SQLite.
Maintains backward compatibility.
"""
def __init__(
self,
client: ClientAPI,
collection_name: str = "news_collection",
sqlite_store: Optional[SQLiteStore] = None
):
self.client = client
self.collection = self.client.get_or_create_collection(name=collection_name)
self.sqlite = sqlite_store # New: SQLite shadow for indexed queries
async def get_latest(
self,
limit: int = 10,
category: Optional[str] = None,
offset: int = 0 # NEW: Pagination support
) -> List[EnrichedNewsItemDTO]:
"""Get latest using SQLite index - O(log n + k)."""
if self.sqlite and offset == 0: # Only use fast path for simple queries
rows = await self.sqlite.get_latest_indexed(limit, category)
# Reconstruct DTOs from SQLite rows
return [self._row_to_dto(row) for row in rows]
# Fallback to legacy behavior with pagination
return await self._get_latest_legacy(limit, category, offset)
async def _get_latest_legacy(
self,
limit: int,
category: Optional[str],
offset: int
) -> List[EnrichedNewsItemDTO]:
"""Legacy implementation with pagination."""
where: Any = {"category": category} if category else None
results = await asyncio.to_thread(
self.collection.get,
include=["metadatas", "documents"],
where=where
)
# ... existing sorting logic ...
# Apply offset/limit after sorting
return items[offset:offset + limit]
async def get_stats(self) -> Dict[str, Any]:
"""Get stats - O(1) from cache."""
if self.sqlite:
return await self.sqlite.get_stats_fast()
return await self._get_stats_legacy()
```
### 3.4 Phase 3: Hybrid Search with RRF (Week 4)
**Objective:** Implement hybrid keyword + semantic search using RRF fusion.
```python
# src/storage/hybrid_search.py
from typing import List, Tuple
from src.processor.dto import EnrichedNewsItemDTO
class HybridSearchStrategy:
"""
Reciprocal Rank Fusion for combining keyword and semantic search.
Implements RRF: Score = Σ 1/(k + rank_i)
"""
def __init__(self, k: int = 60):
self.k = k # RRF smoothing factor
def fuse(
self,
keyword_results: List[Tuple[str, float]], # (id, score)
semantic_results: List[Tuple[str, float]], # (id, score)
) -> List[Tuple[str, float]]:
"""
Fuse results using RRF.
Higher fused score = better.
"""
fused_scores: Dict[str, float] = {}
# Keyword results get higher initial weight
for rank, (doc_id, score) in enumerate(keyword_results):
rrf_score = 1 / (self.k + rank)
fused_scores[doc_id] = fused_scores.get(doc_id, 0) + rrf_score * 2.0
# Semantic results
for rank, (doc_id, score) in enumerate(semantic_results):
rrf_score = 1 / (self.k + rank)
fused_scores[doc_id] = fused_scores.get(doc_id, 0) + rrf_score * 1.0
# Sort by fused score descending
sorted_results = sorted(
fused_scores.items(),
key=lambda x: x[1],
reverse=True
)
return sorted_results
async def search(
self,
query: str,
limit: int = 5,
category: Optional[str] = None,
threshold: Optional[float] = None
) -> List[EnrichedNewsItemDTO]:
"""
Execute hybrid search:
1. SQLite FTS for exact keyword matches
2. ChromaDB for semantic similarity
3. RRF fusion
"""
# Phase 1: Keyword search (FTS)
fts_ids = await self.sqlite.search_fts(query, limit * 2, category)
# Phase 2: Semantic search
semantic_results = await self.chroma.search(
query, limit=limit * 2, category=category, threshold=threshold
)
# Phase 3: RRF Fusion
keyword_scores = [(id, 1.0) for id in fts_ids] # FTS doesn't give scores
semantic_scores = [
(self._get_doc_id(item), self._compute_adaptive_score(item))
for item in semantic_results
]
fused = self.fuse(keyword_scores, semantic_scores)
# Retrieve full DTOs in fused order
final_results = []
for doc_id, _ in fused[:limit]:
dto = await self.chroma.get_by_id(doc_id)
if dto:
final_results.append(dto)
return final_results
def _compute_adaptive_score(self, item: EnrichedNewsItemDTO) -> float:
"""
Compute adaptive score combining relevance and recency.
More recent items with same relevance get slight boost.
"""
# Normalize relevance to 0-1
relevance_norm = item.relevance_score / 10.0
# Days since publication (exponential decay)
days_old = (datetime.now() - item.timestamp).days
recency_decay = 0.95 ** days_old
return relevance_norm * 0.7 + recency_decay * 0.3
```
### 3.5 Phase 4: Stats Cache with Invalidation (Week 5)
**Objective:** Eliminate O(n) stats computation via incremental updates.
```python
# src/storage/stats_cache.py
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List
import asyncio
import json
@dataclass
class StatsCache:
"""Thread-safe statistics cache with incremental updates."""
total_count: int = 0
category_counts: Dict[str, int] = field(default_factory=dict)
source_counts: Dict[str, int] = field(default_factory=dict)
anomaly_counts: Dict[str, int] = field(default_factory=dict)
last_updated: datetime = field(default_factory=datetime.now)
_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
async def invalidate(self):
"""Invalidate cache - marks for recomputation."""
async with self._lock:
self.total_count = -1 # Sentinel value
self.category_counts = {}
self.source_counts = {}
self.anomaly_counts = {}
async def apply_delta(self, delta: "StatsDelta"):
"""
Apply incremental update without full recomputation.
O(1) operation.
"""
async with self._lock:
self.total_count += delta.total_delta
for cat, count in delta.category_deltas.items():
self.category_counts[cat] = self.category_counts.get(cat, 0) + count
for source, count in delta.source_deltas.items():
self.source_counts[source] = self.source_counts.get(source, 0) + count
for anomaly, count in delta.anomaly_deltas.items():
self.anomaly_counts[anomaly] = self.anomaly_counts.get(anomaly, 0) + count
self.last_updated = datetime.now()
def is_valid(self) -> bool:
"""Check if cache is populated."""
return self.total_count >= 0
@dataclass
class StatsDelta:
"""Delta for incremental stats update."""
total_delta: int = 0
category_deltas: Dict[str, int] = field(default_factory=dict)
source_deltas: Dict[str, int] = field(default_factory=dict)
anomaly_deltas: Dict[str, int] = field(default_factory=dict)
@classmethod
def from_item_add(cls, item: EnrichedNewsItemDTO) -> "StatsDelta":
"""Create delta for adding an item."""
return cls(
total_delta=1,
category_deltas={item.category: 1},
source_deltas={item.source: 1},
anomaly_deltas={a.value: 1 for a in item.anomalies_detected}
)
# Integration with storage
class CachedVectorStore(IVectorStore):
"""VectorStore with incremental stats cache."""
def __init__(self, ...):
self.stats_cache = StatsCache()
# ...
async def store(self, item: EnrichedNewsItemDTO) -> str:
# Store item
item_id = await self._store_impl(item)
# Update cache incrementally
delta = StatsDelta.from_item_add(item)
await self.stats_cache.apply_delta(delta)
return item_id
async def get_stats(self) -> Dict[str, Any]:
"""Get stats - O(1) from cache if valid."""
if self.stats_cache.is_valid():
return {
"total_count": self.stats_cache.total_count,
"category_counts": self.stats_cache.category_counts,
"source_counts": self.stats_cache.source_counts,
"anomaly_counts": self.stats_cache.anomaly_counts,
"last_updated": self.stats_cache.last_updated
}
# Fall back to full recomputation (still via SQLite)
return await self._recompute_stats()
```
---
## 4. Backward Compatibility Strategy
### 4.1 Interface Compatibility
The `IVectorStore` interface remains unchanged. New methods have default implementations:
```python
class IVectorStore(IStoreCommand, IStoreQuery):
"""Backward-compatible interface."""
async def get_latest(
self,
limit: int = 10,
category: Optional[str] = None,
offset: int = 0 # NEW: Optional pagination
) -> List[EnrichedNewsItemDTO]:
"""
Default implementation maintains legacy behavior.
Override in implementation for optimized path.
"""
return await self._default_get_latest(limit, category)
async def get_top_ranked(
self,
limit: int = 10,
category: Optional[str] = None,
offset: int = 0 # NEW: Optional pagination
) -> List[EnrichedNewsItemDTO]:
"""Default implementation maintains legacy behavior."""
return await self._default_get_top_ranked(limit, category)
```
### 4.2 DTO Compatibility Layer
```python
# src/storage/compat.py
class DTOConverter:
"""
Converts between legacy and normalized DTO formats.
Ensures smooth transition.
"""
@staticmethod
def normalize_anomalies(
anomalies: Union[List[str], str]
) -> List[AnomalyType]:
"""
Handle both legacy comma-joined strings and new list format.
"""
if isinstance(anomalies, str):
# Legacy format: "WebGPU,NPU acceleration"
return [AnomalyType.from_string(a) for a in anomalies.split(",") if a]
return [AnomalyType.from_string(a) if isinstance(a, str) else a for a in anomalies]
@staticmethod
def to_legacy_dict(dto: EnrichedNewsItemDTO) -> Dict[str, Any]:
"""
Convert normalized DTO to legacy format for API compatibility.
"""
return {
**dto.model_dump(),
# Legacy comma-joined format for API consumers
"anomalies_detected": ",".join(a.value if isinstance(a, AnomalyType) else str(a)
for a in dto.anomalies_detected)
}
```
### 4.3 Dual-Mode Storage
```python
# src/storage/adaptive_store.py
class AdaptiveVectorStore(IVectorStore):
"""
Storage adapter that switches between legacy and normalized
based on feature flags.
"""
def __init__(
self,
legacy_store: ChromaStore,
normalized_store: NormalizedChromaStore,
feature_flags: FeatureFlags
):
self.legacy = legacy_store
self.normalized = normalized_store
self.flags = feature_flags
async def get_latest(
self,
limit: int = 10,
category: Optional[str] = None,
offset: int = 0
) -> List[EnrichedNewsItemDTO]:
if is_enabled(FeatureFlags.NORMALIZED_STORAGE):
return await self.normalized.get_latest(limit, category, offset)
return await self.legacy.get_latest(limit, category)
# Similar delegation for all methods...
```
---
## 5. Risk Mitigation
### 5.1 Rollback Plan
```
┌─────────────────────────────────────────────────────────────────┐
│ ROLLBACK DECISION TREE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Phase 1 Complete │
│ ├── Feature Flag: NORMALIZED_STORAGE = ON │
│ ├── Dual-write active │
│ └── Can rollback: YES (disable flag, use legacy) │
│ │
│ Phase 2 Complete │
│ ├── SQLite indexes active │
│ ├── ChromaDB still primary │
│ └── Can rollback: YES (disable flag, rebuild legacy indexes) │
│ │
│ Phase 3 Complete │
│ ├── Hybrid search active │
│ ├── FTS data populated │
│ └── Can rollback: YES (disable flag, purge FTS triggers) │
│ │
│ Phase 4 Complete │
│ ├── Stats cache active │
│ └── Can rollback: YES (invalidate cache, use legacy) │
│ │
│ ⚠️ CRITICAL: After Phase 4, data divergence makes │
│ clean rollback impossible. Requires data sync tool. │
│ │
└─────────────────────────────────────────────────────────────────┘
```
### 5.2 Data Validation Strategy
```python
# tests/migration/test_data_integrity.py
import pytest
from datetime import datetime, timezone
from src.processor.dto import EnrichedNewsItemDTO
from src.storage.validators import DataIntegrityValidator
class TestDataIntegrity:
"""
Comprehensive data integrity tests for migration.
Run after each phase to validate.
"""
@pytest.fixture
def validator(self):
return DataIntegrityValidator()
@pytest.mark.asyncio
async def test_anomaly_roundtrip(self, validator):
"""Test anomaly normalization roundtrip."""
original_anomalies = ["WebGPU", "NPU acceleration", "Edge AI"]
dto = EnrichedNewsItemDTO(
...,
anomalies_detected=original_anomalies
)
# Store
item_id = await storage.store(dto)
# Retrieve
retrieved = await storage.get_by_id(item_id)
# Validate normalized form
validator.assert_anomalies_match(
retrieved.anomalies_detected,
original_anomalies
)
@pytest.mark.asyncio
async def test_legacy_compat_mode(self, validator):
"""
Test that legacy comma-joined strings still work.
Simulates old client sending comma-joined anomalies.
"""
# Simulate legacy input
legacy_metadata = {
"title": "Test",
"url": "https://example.com/legacy",
"content_text": "Content",
"source": "LegacySource",
"timestamp": "2024-01-01T00:00:00",
"relevance_score": 5,
"summary_ru": "Резюме",
"category": "Tech",
"anomalies_detected": "WebGPU,Edge AI" # Legacy format
}
# Should be automatically normalized
dto = DTOConverter.legacy_metadata_to_dto(legacy_metadata)
validator.assert_valid_dto(dto)
@pytest.mark.asyncio
async def test_stats_consistency(self, validator):
"""
Test that stats from cache match actual data.
Validates incremental update correctness.
"""
# Add items
await storage.store(item1)
await storage.store(item2)
# Get cached stats
cached_stats = await storage.get_stats(use_cache=True)
# Get actual counts
actual_stats = await storage.get_stats(use_cache=False)
validator.assert_stats_match(cached_stats, actual_stats)
@pytest.mark.asyncio
async def test_pagination_consistency(self, validator):
"""
Test that pagination doesn't skip or duplicate items.
"""
items = [create_test_item(i) for i in range(100)]
for item in items:
await storage.store(item)
# Get first page
page1 = await storage.get_latest(limit=10, offset=0)
# Get second page
page2 = await storage.get_latest(limit=10, offset=10)
# Validate no overlap
validator.assert_no_overlap(page1, page2)
# Validate total count
total = await storage.get_stats()
assert sum(len(p) for p in [page1, page2]) <= total["total_count"]
```
### 5.3 Monitoring and Alerts
```python
# src/monitoring/migration_health.py
from dataclasses import dataclass
from datetime import datetime
from typing import Dict, Any
@dataclass
class MigrationHealthCheck:
"""Health checks for migration progress."""
phase: int
checks: Dict[str, bool]
def is_healthy(self) -> bool:
return all(self.checks.values())
def to_report(self) -> Dict[str, Any]:
return {
"phase": self.phase,
"healthy": self.is_healthy(),
"checks": self.checks,
"timestamp": datetime.now().isoformat()
}
class MigrationMonitor:
"""Monitor migration health and emit alerts."""
async def check_phase1_health(self) -> MigrationHealthCheck:
"""Phase 1 health checks."""
checks = {
"dual_write_active": await self._check_dual_write(),
"anomaly_normalized": await self._check_anomaly_normalization(),
"no_data_loss": await self._check_data_integrity(),
"backward_compat": await self._check_compat_mode(),
}
return MigrationHealthCheck(phase=1, checks=checks)
async def _check_data_integrity(self) -> bool:
"""
Sample 100 items and verify normalized vs legacy match.
"""
legacy_store = ChromaStore(...)
normalized_store = NormalizedChromaStore(...)
sample_ids = await legacy_store._sample_ids(100)
for item_id in sample_ids:
legacy_item = await legacy_store.get_by_id(item_id)
normalized_item = await normalized_store.get_by_id(item_id)
if not self._items_match(legacy_item, normalized_item):
return False
return True
def _items_match(
self,
legacy: EnrichedNewsItemDTO,
normalized: EnrichedNewsItemDTO
) -> bool:
"""Verify legacy and normalized items are semantically equal."""
return (
legacy.title == normalized.title and
legacy.url == normalized.url and
legacy.relevance_score == normalized.relevance_score and
set(normalized.anomalies_detected) == set(legacy.anomalies_detected)
)
```
---
## 6. Performance Targets
### 6.1 Benchmarks
| Operation | Current | Phase 1 | Phase 2 | Phase 3 | Phase 4 |
|-----------|---------|---------|---------|---------|---------|
| `get_latest` (10 items) | 500ms | 100ms | 10ms | 10ms | 10ms |
| `get_latest` (100 items) | 800ms | 200ms | 20ms | 20ms | 20ms |
| `get_top_ranked` (10) | 500ms | 100ms | 10ms | 10ms | 10ms |
| `get_stats` | 200ms | 50ms | 20ms | 20ms | 1ms |
| `search` (keyword) | 50ms | 50ms | 50ms | 10ms | 10ms |
| `search` (semantic) | 100ms | 100ms | 100ms | 50ms | 50ms |
| `search` (hybrid) | N/A | N/A | N/A | 60ms | 30ms |
| `store` (single) | 10ms | 15ms | 15ms | 15ms | 15ms |
| Memory (1000 items) | O(n) | O(n) | O(1) | O(1) | O(1) |
### 6.2 Scaling Expectations
```
┌─────────────────────────────────────────────────────────────────┐
│ PERFORMANCE vs DATASET SIZE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ get_latest (limit=10) │
│ │
│ Current: O(n) │
│ ├─ 100 items: 50ms │
│ ├─ 1,000 items: 500ms │
│ ├─ 10,000 items: 5,000ms (5s) ⚠️ │
│ └─ 100,000 items: 50,000ms (50s) 🚫 │
│ │
│ Optimized (SQLite indexed): O(log n + k) │
│ ├─ 100 items: 5ms │
│ ├─ 1,000 items: 7ms │
│ ├─ 10,000 items: 10ms │
│ ├─ 100,000 items: 15ms │
│ └─ 1,000,000 items: 25ms │
│ │
│ Target: Support 1M+ items with <50ms latency │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## 7. Implementation Phases
### Phase 0: Infrastructure (Week 1)
**Dependencies:** None
**Owner:** Senior Architecture Engineer
1. Create feature flags system
2. Implement `DualWriter` class
3. Create migration test suite
4. Set up SQLite database initialization
5. Create `AnomalyType` enum
6. Implement `DTOConverter` for backward compat
**Deliverables:**
- `src/config/feature_flags.py`
- `src/storage/dual_writer.py`
- `src/processor/anomaly_types.py`
- `src/storage/sqlite_store.py` (schema only)
- `tests/migrations/`
**Exit Criteria:**
- [ ] All phase 0 tests pass
- [ ] Feature flags system operational
- [ ] Dual-write writes to both stores
---
### Phase 1: Normalized Storage (Week 2)
**Dependencies:** Phase 0
**Owner:** Data Engineer
1. Implement `NormalizedChromaStore`
2. Create anomaly junction collection
3. Implement anomaly normalization
4. Update `IVectorStore` interface with new signatures
5. Create `AdaptiveVectorStore` for feature-flag switching
6. Write integration tests
**Deliverables:**
- `src/storage/normalized_chroma_store.py`
- `src/storage/adaptive_store.py`
- Updated `IVectorStore` interface
**Exit Criteria:**
- [ ] Normalized storage stores/retrieves correctly
- [ ] Anomaly enum normalization works
- [ ] Dual-write to legacy + normalized active
- [ ] All tests pass with both stores
---
### Phase 2: Indexed Queries (Week 3)
**Dependencies:** Phase 1
**Owner:** Backend Architect
1. Complete SQLite schema with indexes
2. Implement FTS5 virtual table and triggers
3. Implement `get_latest_indexed()` using SQLite
4. Implement `get_top_ranked_indexed()` using SQLite
5. Update `ChromaStore` to delegate to SQLite
6. Implement pagination (offset/limit)
**Deliverables:**
- Complete `SQLiteStore` implementation
- `ChromaStore` updated with indexed queries
- Pagination support in interface
**Exit Criteria:**
- [ ] `get_latest` uses SQLite index (verify with EXPLAIN)
- [ ] `get_top_ranked` uses SQLite index
- [ ] Pagination works correctly
- [ ] No full-collection scans in hot path
---
### Phase 3: Hybrid Search (Week 4)
**Dependencies:** Phase 2
**Owner:** Backend Architect
1. Implement `HybridSearchStrategy` with RRF
2. Integrate SQLite FTS with ChromaDB semantic
3. Add adaptive scoring (relevance + recency)
4. Implement `search_stream()` for large results
5. Performance test hybrid vs legacy
**Deliverables:**
- `src/storage/hybrid_search.py`
- Updated `ChromaStore.search()`
**Exit Criteria:**
- [ ] Hybrid search returns better results than pure semantic
- [ ] Keyword matches appear first
- [ ] RRF fusion working correctly
- [ ] Threshold filtering applies to semantic results
---
### Phase 4: Stats Cache (Week 5)
**Dependencies:** Phase 3
**Owner:** Data Engineer
1. Implement `StatsCache` class with incremental updates
2. Integrate cache with `AdaptiveVectorStore`
3. Implement cache invalidation triggers
4. Add cache warming on startup
5. Performance test `get_stats()`
**Deliverables:**
- `src/storage/stats_cache.py`
- Updated storage layer with cache
**Exit Criteria:**
- [ ] `get_stats` returns in <5ms (cached)
- [ ] Cache invalidates correctly on store/delete
- [ ] Cache rebuilds correctly on demand
---
### Phase 5: Validation & Cutover (Week 6)
**Dependencies:** All previous phases
**Owner:** QA Engineer
1. Run full migration test suite
2. Performance benchmark comparison
3. Load test with simulated traffic
4. Validate all Telegram bot commands work
5. Generate migration completion report
6. Disable legacy code paths (optional)
**Deliverables:**
- Migration completion report
- Performance benchmark report
- Legacy code removal (if approved)
**Exit Criteria:**
- [ ] All tests pass
- [ ] Performance targets met
- [ ] No regression in bot functionality
- [ ] Stakeholder sign-off
---
## 8. Open Questions
### 8.1 Data Migration
| Question | Impact | Priority |
|----------|--------|----------|
| Should we migrate existing comma-joined anomalies to normalized junction records? | Data integrity | HIGH |
| What is the expected dataset size at migration time? | Performance planning | HIGH |
| Can we have downtime for the migration, or must it be zero-downtime? | Rollback strategy | HIGH |
| Should we keep legacy ChromaDB collection or archive it? | Storage costs | MEDIUM |
### 8.2 Architecture
| Question | Impact | Priority |
|----------|--------|----------|
| Do we want to eventually migrate away from ChromaDB to a more scalable solution (Qdrant, Weaviate)? | Future planning | MEDIUM |
| Should anomalies be stored in ChromaDB or only in SQLite? | Consistency model | MEDIUM |
| Do we need multi-tenancy support for multiple Telegram channels? | Future feature | LOW |
### 8.3 Operations
| Question | Impact | Priority |
|----------|--------|----------|
| What is the acceptable cache staleness for `get_stats`? | Consistency vs performance | HIGH |
| Should we implement TTL for old items? | Storage management | MEDIUM |
| Do we need backup/restore procedures for SQLite? | Disaster recovery | HIGH |
### 8.4 Testing
| Question | Impact | Priority |
|----------|--------|----------|
| Should we implement chaos testing for dual-write failure modes? | Reliability | MEDIUM |
| What is the acceptable test coverage threshold? | Quality | HIGH |
| Do we need integration tests with real ChromaDB instance? | Confidence | HIGH |
---
## 9. Appendix
### A. File Structure After Migration
```
src/
├── config/
│ ├── __init__.py
│ ├── feature_flags.py # NEW
│ └── settings.py
├── storage/
│ ├── __init__.py
│ ├── base.py # UPDATED: Evolved interface
│ ├── chroma_store.py # UPDATED: Delegates to SQLite
│ ├── normalized_chroma_store.py # NEW
│ ├── sqlite_store.py # NEW
│ ├── hybrid_search.py # NEW
│ ├── stats_cache.py # NEW
│ ├── adaptive_store.py # NEW
│ ├── dual_writer.py # NEW
│ ├── compat.py # NEW
│ └── migrations/ # NEW
│ ├── __init__.py
│ ├── v1_normalize_anomalies.py
│ └── v2_add_indexes.py
├── processor/
│ ├── __init__.py
│ ├── dto.py # UPDATED: New DTOs
│ ├── anomaly_types.py # NEW
│ └── ...
├── crawlers/
│ ├── __init__.py
│ └── ...
├── bot/
│ ├── __init__.py
│ ├── handlers.py # UPDATED: Use new pagination
│ └── ...
└── main.py # UPDATED: Initialize new stores
```
### B. ChromaDB Version Consideration
> **Note:** The current implementation uses ChromaDB's `upsert` with metadata. ChromaDB has limitations:
> - No native sorting by metadata fields
> - No native pagination (offset/limit)
> - `where` clause filtering is limited
>
> The SQLite shadow database addresses these limitations while maintaining ChromaDB for vector operations.
### C. Monitoring Queries for Production
```sql
-- Check SQLite index usage
EXPLAIN QUERY PLAN
SELECT * FROM news_items
ORDER BY timestamp DESC
LIMIT 10;
-- Check FTS index
EXPLAIN QUERY PLAN
SELECT * FROM news_fts
WHERE news_fts MATCH 'WebGPU';
-- Check cache freshness
SELECT key, total_count, last_updated
FROM stats_cache;
-- Check anomaly distribution
SELECT a.name, COUNT(*) as count
FROM news_anomalies na
JOIN anomaly_types a ON na.anomaly_id = a.id
GROUP BY a.name
ORDER BY count DESC;
```
---
## 10. Approval
| Role | Name | Date | Signature |
|------|------|------|-----------|
| Senior Architecture Engineer | | | |
| Project Manager | | | |
| QA Lead | | | |
| DevOps Lead | | | |
---
*Document generated for team implementation review.*