Artur Mukhamadiev ef3faec7f8 #Feature: GitHub Trending Scouting
:Release Notes:
- Added a new GitHub Trending crawler that scouts for trending repositories across monthly, weekly, and daily timeframes.

:Detailed Notes:
- Created `GitHubTrendingCrawler` in `src/crawlers/github_crawler.py` to parse github.com/trending HTML.
- Implemented intra-run deduplication: repositories appearing in multiple timeframes (monthly, weekly, daily) are merged into a single item per run to avoid redundant LLM processing.
- Registered the new crawler in `src/crawlers/factory.py` and added it to the configuration file `src/crawlers.yml`.
- Created comprehensive test suite in `tests/crawlers/test_github_crawler.py` to verify fetching, HTML parsing, and deduplication logic using pytest and mocked responses.

:Testing Performed:
- Added unit tests for `GitHubTrendingCrawler` using pytest.
- Verified all tests pass successfully.
- Ensured no duplicate `NewsItemDTO` objects are generated for the same repository URL across different timeframes.

:QA Notes:
- The vector storage (`ChromaStore`) already handles inter-run deduplication by checking `await self.storage.exists(item.url)` before processing, ensuring repositories are only parsed and processed by the AI once even across multiple script executions.

:Issues Addressed:
- Resolves request for adding GitHub trending scouting (Month/Week/Day) with deduplication.

Change-Id: Ifbcde830263264576e4fadb70f09a6e2e12e3016
2026-03-19 21:35:51 +03:00
2026-03-16 14:20:26 +03:00
2026-03-19 21:35:51 +03:00
2026-03-19 21:35:51 +03:00
2026-03-16 14:20:26 +03:00
2026-03-13 11:48:37 +03:00

Trend-Scout AI

Trend-Scout AI is an intelligent Telegram bot designed for automated monitoring, analysis, and summarization of technological trends. It was developed to support R&D activities (specifically within the context of LG Electronics R&D Lab in St. Petersburg) by scanning the environment for emerging technologies, competitive benchmarks, and scientific breakthroughs.

🚀 Key Features

  • Automated Multi-Source Crawling: Monitors RSS feeds, scientific journals (Nature, Science), IT conferences (CES, CVPR), and corporate newsrooms using Playwright and Scrapy.
  • AI-Powered Analysis: Utilizes LLMs (via Ollama API) to evaluate the relevance of news articles based on specific R&D landscapes (e.g., WebOS, Chromium, Edge AI).
  • Russian Summarization: Automatically generates concise summaries in Russian for quick review.
  • Anomaly Detection: Alerts users when there is a significant surge in mentions of specific technologies (e.g., "WebGPU", "NPU acceleration").
  • Semantic Search: Employs a vector database (ChromaDB) to allow searching for trends and news by meaning rather than just keywords.
  • Telegram Interface: Simple and effective interaction via Telegram for receiving alerts and querying the latest trends.

🏗 Architecture

The project follows a modular, agent-based architecture designed around SOLID principles and asynchronous I/O:

  1. Crawler Agent: Responsible for fetching and parsing data from various sources into standardized DTOs.
  2. AI Processor Agent: Enriches data by scoring relevance, summarizing content, and detecting technological anomalies using LLMs.
  3. Vector Storage Agent: Manages persistent storage and semantic retrieval using ChromaDB.
  4. Telegram Bot Agent: Handles user interaction, command processing (/start, /latest, /help), and notification delivery.
  5. Orchestrator: Coordinates the flow between crawling, processing, and storage in periodic background iterations.

🛠 Tech Stack

  • Language: Python 3.12+
  • Frameworks: aiogram (Telegram Bot), playwright (Web Crawling), pydantic (Data Validation)
  • Database: ChromaDB (Vector Store)
  • AI/LLM: Ollama (local or cloud models)
  • Testing: pytest, pytest-asyncio
  • Environment: Docker-ready, .env for configuration

📋 Prerequisites

  • Python 3.12 or higher
  • Ollama installed and running (for AI processing)
  • Playwright browsers installed (playwright install chromium)

⚙️ Installation & Setup

  1. Clone the repository:

    git clone https://github.com/your-repo/trend-scout-ai.git
    cd trend-scout-ai
    
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install dependencies:

    pip install -r requirements.txt
    playwright install chromium
    
  4. Configure environment variables: Create a .env file in the root directory:

    TELEGRAM_BOT_TOKEN=your_bot_token_here
    TELEGRAM_CHAT_ID=your_chat_id_here
    OLLAMA_API_URL=http://localhost:11434/api/generate
    CHROMA_DB_PATH=./chroma_db
    

🏃 Usage

Start the Bot and Background Crawler

To run the full system (bot + periodic crawler):

python -m src.main

Run Manual Update

To trigger a manual crawl and update of the vector store:

python update_chroma_store.py

🧪 Testing

The project maintains a high test coverage following TDD principles.

Run all tests:

pytest

Run specific test categories:

pytest tests/crawlers/
pytest tests/processor/
pytest tests/storage/

📂 Project Structure

  • src/: Core application logic.
    • bot/: Telegram bot handlers and setup.
    • crawlers/: Web scraping modules and factory.
    • processor/: LLM integration and prompt logic.
    • storage/: Vector database operations.
    • orchestrator/: Main service coordination.
  • tests/: Comprehensive test suite.
  • docs/: Architecture Decision Records (ADR) and methodology.
  • chroma_db/: Persistent vector storage (local).
  • requirements.txt: Python dependencies.
Description
Yet another web scrapper generated by AI for the needs of R&D Lab
Readme 424 KiB
Languages
Python 100%