Go to file

Artur Mukhamadiev ef3faec7f8 #Feature: GitHub Trending Scouting

:Release Notes:
- Added a new GitHub Trending crawler that scouts for trending repositories across monthly, weekly, and daily timeframes.

:Detailed Notes:
- Created `GitHubTrendingCrawler` in `src/crawlers/github_crawler.py` to parse github.com/trending HTML.
- Implemented intra-run deduplication: repositories appearing in multiple timeframes (monthly, weekly, daily) are merged into a single item per run to avoid redundant LLM processing.
- Registered the new crawler in `src/crawlers/factory.py` and added it to the configuration file `src/crawlers.yml`.
- Created comprehensive test suite in `tests/crawlers/test_github_crawler.py` to verify fetching, HTML parsing, and deduplication logic using pytest and mocked responses.

:Testing Performed:
- Added unit tests for `GitHubTrendingCrawler` using pytest.
- Verified all tests pass successfully.
- Ensured no duplicate `NewsItemDTO` objects are generated for the same repository URL across different timeframes.

:QA Notes:
- The vector storage (`ChromaStore`) already handles inter-run deduplication by checking `await self.storage.exists(item.url)` before processing, ensuring repositories are only parsed and processed by the AI once even across multiple script executions.

:Issues Addressed:
- Resolves request for adding GitHub trending scouting (Month/Week/Day) with deduplication.

Change-Id: Ifbcde830263264576e4fadb70f09a6e2e12e3016

2026-03-19 21:35:51 +03:00

.opencode/agents

opencode agents prompts

2026-03-16 14:20:26 +03:00

ai/memory-bank/tasks

feat(storage): implement hybrid search and fix async chroma i/o

2026-03-16 00:11:07 +03:00

docs

feat(storage): implement hybrid search and fix async chroma i/o

2026-03-16 00:11:07 +03:00

src

#Feature: GitHub Trending Scouting

2026-03-19 21:35:51 +03:00

tests

#Feature: GitHub Trending Scouting

2026-03-19 21:35:51 +03:00

.gitignore

opencode agents prompts

2026-03-16 14:20:26 +03:00

AGENTS.md

[ai] mvp generated by gemini

2026-03-13 11:48:37 +03:00

README.md

feat(ai): optimize processor for academic content

2026-03-16 00:11:19 +03:00

requirements.txt

feat(storage): implement hybrid search and fix async chroma i/o

2026-03-16 00:11:07 +03:00

update_chroma_store.py

feat(storage): implement hybrid search and fix async chroma i/o

2026-03-16 00:11:07 +03:00

README.md

Trend-Scout AI

Trend-Scout AI is an intelligent Telegram bot designed for automated monitoring, analysis, and summarization of technological trends. It was developed to support R&D activities (specifically within the context of LG Electronics R&D Lab in St. Petersburg) by scanning the environment for emerging technologies, competitive benchmarks, and scientific breakthroughs.

🚀 Key Features

Automated Multi-Source Crawling: Monitors RSS feeds, scientific journals (Nature, Science), IT conferences (CES, CVPR), and corporate newsrooms using Playwright and Scrapy.
AI-Powered Analysis: Utilizes LLMs (via Ollama API) to evaluate the relevance of news articles based on specific R&D landscapes (e.g., WebOS, Chromium, Edge AI).
Russian Summarization: Automatically generates concise summaries in Russian for quick review.
Anomaly Detection: Alerts users when there is a significant surge in mentions of specific technologies (e.g., "WebGPU", "NPU acceleration").
Semantic Search: Employs a vector database (ChromaDB) to allow searching for trends and news by meaning rather than just keywords.
Telegram Interface: Simple and effective interaction via Telegram for receiving alerts and querying the latest trends.

🏗 Architecture

The project follows a modular, agent-based architecture designed around SOLID principles and asynchronous I/O:

Crawler Agent: Responsible for fetching and parsing data from various sources into standardized DTOs.
AI Processor Agent: Enriches data by scoring relevance, summarizing content, and detecting technological anomalies using LLMs.
Vector Storage Agent: Manages persistent storage and semantic retrieval using ChromaDB.
Telegram Bot Agent: Handles user interaction, command processing (/start, /latest, /help), and notification delivery.
Orchestrator: Coordinates the flow between crawling, processing, and storage in periodic background iterations.

🛠 Tech Stack

Language: Python 3.12+
Frameworks: aiogram (Telegram Bot), playwright (Web Crawling), pydantic (Data Validation)
Database: ChromaDB (Vector Store)
AI/LLM: Ollama (local or cloud models)
Testing: pytest, pytest-asyncio
Environment: Docker-ready, .env for configuration

📋 Prerequisites

Python 3.12 or higher
Ollama installed and running (for AI processing)
Playwright browsers installed (playwright install chromium)

⚙️ Installation & Setup

Clone the repository:

git clone https://github.com/your-repo/trend-scout-ai.git
cd trend-scout-ai

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt
playwright install chromium

Configure environment variables: Create a .env file in the root directory:

TELEGRAM_BOT_TOKEN=your_bot_token_here
TELEGRAM_CHAT_ID=your_chat_id_here
OLLAMA_API_URL=http://localhost:11434/api/generate
CHROMA_DB_PATH=./chroma_db

🏃 Usage

Start the Bot and Background Crawler

To run the full system (bot + periodic crawler):

python -m src.main

Run Manual Update

To trigger a manual crawl and update of the vector store:

python update_chroma_store.py

🧪 Testing

The project maintains a high test coverage following TDD principles.

Run all tests:

pytest

Run specific test categories:

pytest tests/crawlers/
pytest tests/processor/
pytest tests/storage/

📂 Project Structure

src/: Core application logic.
- bot/: Telegram bot handlers and setup.
- crawlers/: Web scraping modules and factory.
- processor/: LLM integration and prompt logic.
- storage/: Vector database operations.
- orchestrator/: Main service coordination.
tests/: Comprehensive test suite.
docs/: Architecture Decision Records (ADR) and methodology.
chroma_db/: Persistent vector storage (local).
requirements.txt: Python dependencies.