8 Commits

Author SHA1 Message Date
Artur Mukhamadiev
6d2ac9d0f0 Feature: Filter out sources older than 5 years in Google Scholar Crawler
:Release Notes:
- Updated the Google Scholar crawler to automatically filter out results older than 5 years to ensure recent content.

:Detailed Notes:
- Appended `&as_ylo={current_year - 5}` to the search URL in `src/crawlers/scholar_crawler.py` by dynamically calculating the current year via Python's `datetime`.
- Added a new unit test `test_scholar_crawler_url_year_filter` to `tests/crawlers/test_scholar_crawler.py` to verify URL construction.

:Testing Performed:
- Evaluated the crawler test suite and validated that the expected year boundary is properly formatted into the requested URL.
- All 91 automated pytest cases complete successfully.

:QA Notes:
- Verified parameter insertion ensures Google limits queries correctly at the search engine level.

:Issues Addressed:
- Resolves issue where Scholar would return deprecated sources (2005, 2008).

Change-Id: I56ae2fd7369d61494d17520238c3ef66e14436c7
2026-03-19 14:57:33 +03:00
a304ae9cd2 feat(crawler): add academic and research sources
- Implement crawlers for Microsoft Research, SciRate, and Google Scholar
- Use Playwright with stealth for Google Scholar anti-bot mitigation
- Update CrawlerFactory to support new research crawler types
- Add unit and integration tests for all academic sources with high coverage
2026-03-16 00:11:15 +03:00
217037f72e feat(crawlers): convert multiple sources from Playwright to Static/RSS
- Added `StaticCrawler` for generic aiohttp+BS4 parsing.
- Added `SkolkovoCrawler` for specialized Next.js parsing of sk.ru.
- Converted ICRA 2025, RSF, CES 2025, and Telegram Addmeto to `static`.
- Converted Horizon Europe to `rss` using its native feed.
- Updated `CrawlerFactory` to support new crawler types.
- Validated changes with unit tests.
2026-03-15 21:21:14 +03:00
a363ca41cf feat(crawlers): implement specialized CppConf crawler and AI analysis
- Added CppConfCrawler using aiohttp and regex to parse Next.js JSON data, skipping the Playwright bottleneck.
- Added C++ specific prompts to OllamaProvider for trend analysis (identifying C++26, memory safety, coroutines).
- Created offline pytest fixtures and TDD unit tests for the parser.
- Created end-to-end pipeline test mapping Crawler -> AI Processor -> Vector DB.
2026-03-15 20:34:39 +03:00
87af585e1b Refactor crawlers configuration and add new sources
- Move hard-coded crawlers from main.py to crawlers.yml
- Use CrawlerFactory to load configuration
- Add 9 new sources: C++ Russia, ICRA 2025, Technoprom, INNOPROM, Hannover Messe, RSF, Skolkovo, Horizon Europe, Addmeto
- Update task list
2026-03-15 00:45:04 +03:00
9c31977e98 [feat] playwright crawler
:Release Notes:
-

:Detailed Notes:
-

:Testing Performed:
-

:QA Notes:
as always AI generated

:Issues Addressed:
-
2026-03-14 20:13:53 +03:00
4bf7cb4331 [perf] stabilization of previous release 2026-03-13 13:23:30 +03:00
5f093075f7 [ai] mvp generated by gemini 2026-03-13 11:48:37 +03:00