Skip to content

Releases: spider-rs/spider

v2.48.13 — CLI authenticate for Spider Cloud

31 Mar 21:58

Choose a tag to compare

What's New

spider authenticate command — Store your Spider Cloud API key locally for remote crawls.

Usage

# Authenticate (stores key in ~/.spider/credentials)
spider authenticate sk-your-key
spider auth  # alias, interactive prompt

# Crawl via Spider Cloud (key auto-loaded)
spider crawl -u https://example.com -o

# Choose cloud mode
spider crawl -u https://example.com --spider-cloud-mode smart -o
spider crawl -u https://example.com --spider-cloud-mode unblocker -o

# Remote headless browser via Spider Browser Cloud
spider crawl -u https://example.com --spider-cloud-browser --headless -o

Key Resolution Order

  1. --spider-cloud-key flag
  2. SPIDER_CLOUD_API_KEY env var
  3. Stored credentials from spider authenticate

Cloud Modes

Mode Description
proxy Transparent proxy (default)
api POST /crawl API
unblocker Anti-bot bypass
fallback Direct fetch first, fallback on 403/429/503
smart Auto-detects bot protection

Sign up at https://spider.cloud for an API key.

v2.48.4

25 Mar 21:01

Choose a tag to compare

New spider_mcp crate — MCP server for Spider.

cargo install spider_mcp

Setup (Claude Code ~/.claude/settings.json or Claude Desktop config):

{ "mcpServers": { "spider": { "command": "spider-mcp" } } }

Usage examples:

Scrape a page:       "Fetch https://example.com as markdown"
Crawl a site:        "Crawl https://example.com up to 5 pages"
Extract links:       "Get all links from https://example.com"
Transform HTML:      "Convert this HTML to markdown: <h1>Hello</h1>"

Tools: spider_scrape · spider_crawl · spider_links · spider_transform

v2.48.2 — Parallel Crawl Backends

25 Mar 16:23

Choose a tag to compare

Race alternative browser engines alongside your primary crawl. Best HTML wins.

use spider::configuration::{BackendEndpoint, BackendEngine, ParallelBackendsConfig};

let mut website = Website::new("https://example.com");
website.configuration.parallel_backends = Some(ParallelBackendsConfig {
    backends: vec![BackendEndpoint {
        engine: BackendEngine::LightPanda,
        endpoint: Some("ws://127.0.0.1:9222".to_string()),
        binary_path: None,
        protocol: None,
    }],
    ..Default::default()
});
website.crawl().await;
// page.backend_source → "primary" | "lightpanda" | "servo" | "custom"

Features: lightpanda, servo, parallel_backends_full, custom backends via BackendEngine::Custom.

Zero overhead when disabled, anti-fingerprint jitter, auto-disable on failures, custom quality validator, full network interception parity with Chrome.

v2.47.75

20 Mar 20:55

Choose a tag to compare

What's New

  • PageData & Crawler trait abstractions for extensible crawl pipelines
  • Proxy support for LLM HTTP requests (#378)
  • Chrome remote_addr via CDP Network.responseReceived
  • Remote cache for Chrome responses — dump & fallback support

Performance

  • SIMD-accelerated byte scanning (memchr), unrolled FNV hash
  • Trie: Box<str> keys + manual byte-walk + memchr dot scan
  • Bloom filter bitmask addressing + inline early-exit
  • Zero-alloc DNS cache hits via Arc<[SocketAddr]>
  • Skip robots.txt for single-page crawls, TCP keepalive always
  • Batch LRU cache eviction to reduce write-lock hold time
  • cache_mem auto-enables skip_browser for Chrome mode
  • Remove unnecessary Box allocations and redundant string conversions

Fixes

  • Fall back to HTTP crawl when Chrome is unavailable (#373)
  • Skip redirect-caused net::ERR_ABORTED instead of aborting navigation
  • Deterministic CDP event listener shutdown via watch channel
  • Remote cache fallback + 3s timeout for chrome_remote_cache path
  • Add decentralized stubs for set_url_parsed_direct and links_full

Deps

  • strum 0.28, phf 0.13, gemini-rust 1.7, chromey 2, llm_models_spider 0.1

Full Changelog: v2.47.52...v2.47.75

v2.47.51

19 Mar 15:07

Choose a tag to compare

  • NUMA thread pinning for multi-socket servers (numa feature)
  • zerocopy wire parsing for HTTP status lines, cache headers, DNS records (zero_copy feature)

v2.47.50

19 Mar 14:26

Choose a tag to compare

Zero-copy page passing (bytes::Bytes), mmap+hugepages bloom filter for URL dedup (bloom feature).

v2.47.24

15 Mar 17:55

Choose a tag to compare

io_uring TCP connect + lightweight background runtime

  • io_uring TCP connect: Socket + Connect opcodes for kernel-async TCP connects via the existing uring
    worker
  • Lightweight background runtime: Drops from multi-thread to current-thread tokio executor when
    io_uring is active
  • Public API: uring_fs::tcp_connect(addr), uring_fs::is_uring_enabled()
  • CI fixes: clippy unnecessary_cast, io_other_error, cargo fmt

Full Changelog: v2.47.23...v2.47.24

v2.45.28

02 Mar 15:28

Choose a tag to compare

Agent Hardening

  • Cap LLM-controlled durations (Wait, ClickHold, SetViewport, OpenPage)
  • Add js_escape() for safe JS string interpolation in action handlers
  • Wrap Navigate and screenshot calls with timeouts
  • Use PageWaitStrategy::Load for WaitForNavigation instead of fixed sleep
  • Replace eval_with_timeout for Fill/Type/Clear actions with error propagation
  • Improve semaphore and logging diagnostics on error paths

v2.45.24

21 Feb 12:54

Choose a tag to compare

What's New

Performance

  • Cache-first fast path — skip browser/HTTP entirely when cache has data (~5-50ms vs 1-3s)
  • Deferred Chrome — process multi-page crawls from cache before launching a browser
  • Work-stealing (hedged requests) — parallel retry for slow crawl requests
  • io_uring — StreamingWriter for high-throughput file I/O on Linux

Agent

  • Per-round model pool routing — route cheap rounds to fast models, complex rounds to capable ones
  • Comprehensive router tests — 30+ multi-LLM reliability tests

Fixes

  • Chrome mode honors wait_for config (e.g. idle_network0) before HTML extraction
  • Smart mode lifecycle waiting matches Chrome coverage without re-fetch
  • Auto-retry www. URLs on SSL protocol errors
  • Reject empty HTML from all cache paths
  • Default no_sandbox() for headless Chrome (#354)

CLI

  • --wait-for flag for page load strategies (#352)

Housekeeping

  • Clippy warnings and formatting cleanup across workspace
  • Rewritten README, CONTRIBUTING.md, CHANGELOG backfill
  • Issue/PR templates, security policy, CI workflows (lint, audit, release)

v2.45.20

05 Feb 21:04

Choose a tag to compare

What's New

Relevance Gate for Remote Multimodal Crawling

Added a relevance_gate config that instructs the LLM to return a "relevant": true|false field in its JSON response. When a page is deemed irrelevant, its wildcard budget credit is refunded so the crawler discovers more relevant content.

New config fields:

  • relevance_gate: bool — enables the feature
  • relevance_prompt: Option<String> — optional custom relevance criteria

How it works:

  1. When enabled, the system prompt instructs the LLM to include "relevant": true|false
  2. If the model returns false, a budget credit is atomically accumulated
  3. Credits are drained in the crawl loop to restore the wildcard budget
  4. Default fallback is true (assume relevant) if the model omits the field

Example:

let cfgs = RemoteMultimodalConfigs::new(api_url, model)
    .with_relevance_gate(Some("Only pages about Rust programming".into()));

Full Changelog

  • feat(agent): add relevance_gate and relevance_prompt to RemoteMultimodalConfig
  • feat(agent): add atomic relevance_credits counter to RemoteMultimodalConfigs
  • feat(agent): add relevant: Option<bool> to AutomationResult and AutomationResults
  • feat(agent): extend system prompt and extraction with relevance gate instructions
  • feat(spider): add restore_wildcard_budget() for budget refund
  • feat(spider): drain relevance credits in crawl loop dequeue