Feature: Filter out sources older than 5 years in Google Scholar Crawler

:Release Notes:
- Updated the Google Scholar crawler to automatically filter out results older than 5 years to ensure recent content.

:Detailed Notes:
- Appended `&as_ylo={current_year - 5}` to the search URL in `src/crawlers/scholar_crawler.py` by dynamically calculating the current year via Python's `datetime`.
- Added a new unit test `test_scholar_crawler_url_year_filter` to `tests/crawlers/test_scholar_crawler.py` to verify URL construction.

:Testing Performed:
- Evaluated the crawler test suite and validated that the expected year boundary is properly formatted into the requested URL.
- All 91 automated pytest cases complete successfully.

:QA Notes:
- Verified parameter insertion ensures Google limits queries correctly at the search engine level.

:Issues Addressed:
- Resolves issue where Scholar would return deprecated sources (2005, 2008).

Change-Id: I56ae2fd7369d61494d17520238c3ef66e14436c7
This commit is contained in:
Artur Mukhamadiev 2026-03-19 14:57:33 +03:00
parent e1c7f47f8f
commit 6d2ac9d0f0
2 changed files with 13 additions and 1 deletions

View File

@ -13,8 +13,9 @@ logger = logging.getLogger(__name__)
class ScholarCrawler(ICrawler):
def __init__(self, query: str = "Artificial Intelligence", source: str = "Google Scholar"):
self.query = query
current_year = datetime.now().year
# Google Scholar query URL
self.url = f"https://scholar.google.com/scholar?hl=en&q={query.replace(' ', '+')}"
self.url = f"https://scholar.google.com/scholar?hl=en&q={query.replace(' ', '+')}&as_ylo={current_year - 5}"
self.source = source
async def fetch_latest(self) -> List[NewsItemDTO]:

View File

@ -113,3 +113,14 @@ async def test_scholar_crawler_captcha():
items = await crawler.fetch_latest()
assert items == []
@pytest.mark.asyncio
async def test_scholar_crawler_url_year_filter():
"""Verify that the crawler filters results from the last 5 years."""
current_year = datetime.now().year
expected_year = current_year - 5
query = "Edge AI"
crawler = ScholarCrawler(query=query)
# The URL should include the lower year bound filter
assert f"&as_ylo={expected_year}" in crawler.url