Skip to content

feat: add Tavily Extract as pluggable web scraper option#1303

Open
tavily-integrations wants to merge 1 commit intokhoj-ai:masterfrom
Tavily-FDE:feat/tavily-migration/tavily-web-scraper
Open

feat: add Tavily Extract as pluggable web scraper option#1303
tavily-integrations wants to merge 1 commit intokhoj-ai:masterfrom
Tavily-FDE:feat/tavily-migration/tavily-web-scraper

Conversation

@tavily-integrations
Copy link
Copy Markdown

Summary

  • Added TAVILY to WebScraper.WebScraperType enum in the database model
  • Implemented read_webpage_with_tavily() using the Tavily Extract API (/extract endpoint) to extract content from URLs, returning raw markdown content
  • Added routing branch in scrape_webpage() for the new TAVILY scraper type
  • Added env-var fallback in aget_enabled_webscrapers() adapter so Tavily is auto-discovered when TAVILY_API_KEY is set
  • Added TAVILY_API_KEY and TAVILY_API_URL env var resolution in the model's clean() method (consistent with existing Exa/Firecrawl/Olostep handling)
  • Created Django migration 0100_alter_webscraper_type for the updated choices field
  • Added tavily-python >= 0.5.0 to pyproject.toml dependencies

Files changed

  • src/khoj/database/models/__init__.py — Added TAVILY enum value and env var resolution
  • src/khoj/processor/tools/online_search.py — Added read_webpage_with_tavily() and scraper routing
  • src/khoj/database/adapters/__init__.py — Added TAVILY env-var fallback in adapter
  • src/khoj/database/migrations/0100_alter_webscraper_type.py — New migration for updated choices
  • pyproject.toml — Added tavily-python dependency

Environment variable changes

  • TAVILY_API_KEY — Required to use Tavily web scraper (shared with tavily-web-search unit)
  • TAVILY_API_URL — Optional, defaults to https://api.tavily.com

Dependency changes

  • Added tavily-python >= 0.5.0 to pyproject.toml

Notes for reviewers

  • The scrape_webpage_with_fallback() function requires no changes as it already iterates over all configured WebScraper DB records by priority
  • The Tavily Extract API is called via raw HTTP (aiohttp) consistent with the existing scraper implementations, rather than using the tavily-python SDK directly — the SDK dependency is added for potential use by the tavily-web-search unit
  • This is an additive change; existing scraper providers are unaffected

🤖 Generated with Claude Code

Automated Review

  • Passed after 1 attempt(s)
  • Final review: The Tavily web scraper migration is correct and complete. It adds TAVILY to the WebScraperType enum, implements read_webpage_with_tavily() using the Tavily Extract API via raw aiohttp (consistent with how other scrapers are implemented), routes correctly in scrape_webpage(), adds the env-var fallback in aget_enabled_webscrapers(), creates the Django migration, and adds the pyproject.toml dependency. All changes are scoped appropriately and no regressions are introduced. Two minor issues noted below but neither blocks approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant