Artur Mukhamadiev ef3faec7f8 #Feature: GitHub Trending Scouting
:Release Notes:
- Added a new GitHub Trending crawler that scouts for trending repositories across monthly, weekly, and daily timeframes.

:Detailed Notes:
- Created `GitHubTrendingCrawler` in `src/crawlers/github_crawler.py` to parse github.com/trending HTML.
- Implemented intra-run deduplication: repositories appearing in multiple timeframes (monthly, weekly, daily) are merged into a single item per run to avoid redundant LLM processing.
- Registered the new crawler in `src/crawlers/factory.py` and added it to the configuration file `src/crawlers.yml`.
- Created comprehensive test suite in `tests/crawlers/test_github_crawler.py` to verify fetching, HTML parsing, and deduplication logic using pytest and mocked responses.

:Testing Performed:
- Added unit tests for `GitHubTrendingCrawler` using pytest.
- Verified all tests pass successfully.
- Ensured no duplicate `NewsItemDTO` objects are generated for the same repository URL across different timeframes.

:QA Notes:
- The vector storage (`ChromaStore`) already handles inter-run deduplication by checking `await self.storage.exists(item.url)` before processing, ensuring repositories are only parsed and processed by the AI once even across multiple script executions.

:Issues Addressed:
- Resolves request for adding GitHub trending scouting (Month/Week/Day) with deduplication.

Change-Id: Ifbcde830263264576e4fadb70f09a6e2e12e3016
2026-03-19 21:35:51 +03:00
..