:Release Notes:
- Added a new GitHub Trending crawler that scouts for trending repositories across monthly, weekly, and daily timeframes.
:Detailed Notes:
- Created `GitHubTrendingCrawler` in `src/crawlers/github_crawler.py` to parse github.com/trending HTML.
- Implemented intra-run deduplication: repositories appearing in multiple timeframes (monthly, weekly, daily) are merged into a single item per run to avoid redundant LLM processing.
- Registered the new crawler in `src/crawlers/factory.py` and added it to the configuration file `src/crawlers.yml`.
- Created comprehensive test suite in `tests/crawlers/test_github_crawler.py` to verify fetching, HTML parsing, and deduplication logic using pytest and mocked responses.
:Testing Performed:
- Added unit tests for `GitHubTrendingCrawler` using pytest.
- Verified all tests pass successfully.
- Ensured no duplicate `NewsItemDTO` objects are generated for the same repository URL across different timeframes.
:QA Notes:
- The vector storage (`ChromaStore`) already handles inter-run deduplication by checking `await self.storage.exists(item.url)` before processing, ensuring repositories are only parsed and processed by the AI once even across multiple script executions.
:Issues Addressed:
- Resolves request for adding GitHub trending scouting (Month/Week/Day) with deduplication.
Change-Id: Ifbcde830263264576e4fadb70f09a6e2e12e3016
:Release Notes:
- Updated the Google Scholar crawler to automatically filter out results older than 5 years to ensure recent content.
:Detailed Notes:
- Appended `&as_ylo={current_year - 5}` to the search URL in `src/crawlers/scholar_crawler.py` by dynamically calculating the current year via Python's `datetime`.
- Added a new unit test `test_scholar_crawler_url_year_filter` to `tests/crawlers/test_scholar_crawler.py` to verify URL construction.
:Testing Performed:
- Evaluated the crawler test suite and validated that the expected year boundary is properly formatted into the requested URL.
- All 91 automated pytest cases complete successfully.
:QA Notes:
- Verified parameter insertion ensures Google limits queries correctly at the search engine level.
:Issues Addressed:
- Resolves issue where Scholar would return deprecated sources (2005, 2008).
Change-Id: I56ae2fd7369d61494d17520238c3ef66e14436c7
:Release Notes:
- Added a new Telegram command `/get_hottest <number> [format]` to export the top `N` trends as a CSV or Markdown file.
:Detailed Notes:
- Created `ITrendExporter` interface and concrete `CsvTrendExporter` and `MarkdownTrendExporter` implementations for formatting DTOs.
- Updated `src/bot/handlers.py` to include `command_get_hottest_handler` mapping to `/get_hottest`.
- Used `BufferedInputFile` to stream generated files asynchronously directly to Telegram without disk I/O.
- Fixed unrelated pipeline test failures regarding `EphemeralClient` usage with ChromaDB.
:Testing Performed:
- Implemented TDD with `pytest` for parsing parameters, exporting logic, and handling empty DB scenarios.
- Ran the full test suite (90 tests) which completed successfully.
:QA Notes:
- Fully covered the new handler using `pytest-asyncio` and `aiogram` mocked objects.
:Issues Addressed:
- Resolves request to export high-relevance parsed entries.
Change-Id: I25dd90f1e4491ba298682518d835259bffab4190
- crawlers.yml appended with more google scholar topics, removed habr AI
- in LLM prompt removed C++ trends relation and changed web rendering to
web engine
- Add specialized prompt branch for research papers and SOTA detection
- Improve Russian summarization quality for technical abstracts
- Update relevance scoring to prioritize NPU/Edge AI breakthroughs
- Add README.md with project overview
- Implement crawlers for Microsoft Research, SciRate, and Google Scholar
- Use Playwright with stealth for Google Scholar anti-bot mitigation
- Update CrawlerFactory to support new research crawler types
- Add unit and integration tests for all academic sources with high coverage
- Added `StaticCrawler` for generic aiohttp+BS4 parsing.
- Added `SkolkovoCrawler` for specialized Next.js parsing of sk.ru.
- Converted ICRA 2025, RSF, CES 2025, and Telegram Addmeto to `static`.
- Converted Horizon Europe to `rss` using its native feed.
- Updated `CrawlerFactory` to support new crawler types.
- Validated changes with unit tests.
- Added CppConfCrawler using aiohttp and regex to parse Next.js JSON data, skipping the Playwright bottleneck.
- Added C++ specific prompts to OllamaProvider for trend analysis (identifying C++26, memory safety, coroutines).
- Created offline pytest fixtures and TDD unit tests for the parser.
- Created end-to-end pipeline test mapping Crawler -> AI Processor -> Vector DB.
- Move hard-coded crawlers from main.py to crawlers.yml
- Use CrawlerFactory to load configuration
- Add 9 new sources: C++ Russia, ICRA 2025, Technoprom, INNOPROM, Hannover Messe, RSF, Skolkovo, Horizon Europe, Addmeto
- Update task list