feat: add Tavily Extract as pluggable web scraper option by tavily-integrations · Pull Request #1303 · khoj-ai/khoj

tavily-integrations · 2026-04-01T17:20:30Z

Summary

Added TAVILY to WebScraper.WebScraperType enum in the database model
Implemented read_webpage_with_tavily() using the Tavily Extract API (/extract endpoint) to extract content from URLs, returning raw markdown content
Added routing branch in scrape_webpage() for the new TAVILY scraper type
Added env-var fallback in aget_enabled_webscrapers() adapter so Tavily is auto-discovered when TAVILY_API_KEY is set
Added TAVILY_API_KEY and TAVILY_API_URL env var resolution in the model's clean() method (consistent with existing Exa/Firecrawl/Olostep handling)
Created Django migration 0100_alter_webscraper_type for the updated choices field
Added tavily-python >= 0.5.0 to pyproject.toml dependencies

src/khoj/database/models/__init__.py — Added TAVILY enum value and env var resolution
src/khoj/processor/tools/online_search.py — Added read_webpage_with_tavily() and scraper routing
src/khoj/database/adapters/__init__.py — Added TAVILY env-var fallback in adapter
src/khoj/database/migrations/0100_alter_webscraper_type.py — New migration for updated choices
pyproject.toml — Added tavily-python dependency

TAVILY_API_KEY — Required to use Tavily web scraper (shared with tavily-web-search unit)
TAVILY_API_URL — Optional, defaults to https://api.tavily.com

The scrape_webpage_with_fallback() function requires no changes as it already iterates over all configured WebScraper DB records by priority
The Tavily Extract API is called via raw HTTP (aiohttp) consistent with the existing scraper implementations, rather than using the tavily-python SDK directly — the SDK dependency is added for potential use by the tavily-web-search unit
This is an additive change; existing scraper providers are unaffected

🤖 Generated with Claude Code

Passed after 1 attempt(s)
Final review: The Tavily web scraper migration is correct and complete. It adds TAVILY to the WebScraperType enum, implements read_webpage_with_tavily() using the Tavily Extract API via raw aiohttp (consistent with how other scrapers are implemented), routes correctly in scrape_webpage(), adds the env-var fallback in aget_enabled_webscrapers(), creates the Django migration, and adds the pyproject.toml dependency. All changes are scoped appropriately and no regressions are introduced. Two minor issues noted below but neither blocks approval.

feat: add Tavily as web scraper type for content extraction

902ca44