Releases: spider-rs/spider
v2.48.13 — CLI authenticate for Spider Cloud
What's New
spider authenticate command — Store your Spider Cloud API key locally for remote crawls.
Usage
# Authenticate (stores key in ~/.spider/credentials)
spider authenticate sk-your-key
spider auth # alias, interactive prompt
# Crawl via Spider Cloud (key auto-loaded)
spider crawl -u https://example.com -o
# Choose cloud mode
spider crawl -u https://example.com --spider-cloud-mode smart -o
spider crawl -u https://example.com --spider-cloud-mode unblocker -o
# Remote headless browser via Spider Browser Cloud
spider crawl -u https://example.com --spider-cloud-browser --headless -oKey Resolution Order
--spider-cloud-keyflagSPIDER_CLOUD_API_KEYenv var- Stored credentials from
spider authenticate
Cloud Modes
| Mode | Description |
|---|---|
proxy |
Transparent proxy (default) |
api |
POST /crawl API |
unblocker |
Anti-bot bypass |
fallback |
Direct fetch first, fallback on 403/429/503 |
smart |
Auto-detects bot protection |
Sign up at https://spider.cloud for an API key.
v2.48.4
New spider_mcp crate — MCP server for Spider.
cargo install spider_mcpSetup (Claude Code ~/.claude/settings.json or Claude Desktop config):
{ "mcpServers": { "spider": { "command": "spider-mcp" } } }Usage examples:
Scrape a page: "Fetch https://example.com as markdown"
Crawl a site: "Crawl https://example.com up to 5 pages"
Extract links: "Get all links from https://example.com"
Transform HTML: "Convert this HTML to markdown: <h1>Hello</h1>"
Tools: spider_scrape · spider_crawl · spider_links · spider_transform
v2.48.2 — Parallel Crawl Backends
Race alternative browser engines alongside your primary crawl. Best HTML wins.
use spider::configuration::{BackendEndpoint, BackendEngine, ParallelBackendsConfig};
let mut website = Website::new("https://example.com");
website.configuration.parallel_backends = Some(ParallelBackendsConfig {
backends: vec![BackendEndpoint {
engine: BackendEngine::LightPanda,
endpoint: Some("ws://127.0.0.1:9222".to_string()),
binary_path: None,
protocol: None,
}],
..Default::default()
});
website.crawl().await;
// page.backend_source → "primary" | "lightpanda" | "servo" | "custom"Features: lightpanda, servo, parallel_backends_full, custom backends via BackendEngine::Custom.
Zero overhead when disabled, anti-fingerprint jitter, auto-disable on failures, custom quality validator, full network interception parity with Chrome.
v2.47.75
What's New
- PageData & Crawler trait abstractions for extensible crawl pipelines
- Proxy support for LLM HTTP requests (#378)
- Chrome remote_addr via CDP
Network.responseReceived - Remote cache for Chrome responses — dump & fallback support
Performance
- SIMD-accelerated byte scanning (memchr), unrolled FNV hash
- Trie:
Box<str>keys + manual byte-walk + memchr dot scan - Bloom filter bitmask addressing + inline early-exit
- Zero-alloc DNS cache hits via
Arc<[SocketAddr]> - Skip robots.txt for single-page crawls, TCP keepalive always
- Batch LRU cache eviction to reduce write-lock hold time
cache_memauto-enablesskip_browserfor Chrome mode- Remove unnecessary Box allocations and redundant string conversions
Fixes
- Fall back to HTTP crawl when Chrome is unavailable (#373)
- Skip redirect-caused
net::ERR_ABORTEDinstead of aborting navigation - Deterministic CDP event listener shutdown via watch channel
- Remote cache fallback + 3s timeout for chrome_remote_cache path
- Add decentralized stubs for
set_url_parsed_directandlinks_full
Deps
- strum 0.28, phf 0.13, gemini-rust 1.7, chromey 2, llm_models_spider 0.1
Full Changelog: v2.47.52...v2.47.75
v2.47.51
v2.47.50
v2.47.24
io_uring TCP connect + lightweight background runtime
- io_uring TCP connect: Socket + Connect opcodes for kernel-async TCP connects via the existing uring
worker - Lightweight background runtime: Drops from multi-thread to current-thread tokio executor when
io_uring is active - Public API: uring_fs::tcp_connect(addr), uring_fs::is_uring_enabled()
- CI fixes: clippy unnecessary_cast, io_other_error, cargo fmt
Full Changelog: v2.47.23...v2.47.24
v2.45.28
Agent Hardening
- Cap LLM-controlled durations (Wait, ClickHold, SetViewport, OpenPage)
- Add
js_escape()for safe JS string interpolation in action handlers - Wrap
Navigateand screenshot calls with timeouts - Use
PageWaitStrategy::LoadforWaitForNavigationinstead of fixed sleep - Replace
eval_with_timeoutfor Fill/Type/Clear actions with error propagation - Improve semaphore and logging diagnostics on error paths
v2.45.24
What's New
Performance
- Cache-first fast path — skip browser/HTTP entirely when cache has data (~5-50ms vs 1-3s)
- Deferred Chrome — process multi-page crawls from cache before launching a browser
- Work-stealing (hedged requests) — parallel retry for slow crawl requests
- io_uring — StreamingWriter for high-throughput file I/O on Linux
Agent
- Per-round model pool routing — route cheap rounds to fast models, complex rounds to capable ones
- Comprehensive router tests — 30+ multi-LLM reliability tests
Fixes
- Chrome mode honors
wait_forconfig (e.g.idle_network0) before HTML extraction - Smart mode lifecycle waiting matches Chrome coverage without re-fetch
- Auto-retry www. URLs on SSL protocol errors
- Reject empty HTML from all cache paths
- Default
no_sandbox()for headless Chrome (#354)
CLI
--wait-forflag for page load strategies (#352)
Housekeeping
- Clippy warnings and formatting cleanup across workspace
- Rewritten README, CONTRIBUTING.md, CHANGELOG backfill
- Issue/PR templates, security policy, CI workflows (lint, audit, release)
v2.45.20
What's New
Relevance Gate for Remote Multimodal Crawling
Added a relevance_gate config that instructs the LLM to return a "relevant": true|false field in its JSON response. When a page is deemed irrelevant, its wildcard budget credit is refunded so the crawler discovers more relevant content.
New config fields:
relevance_gate: bool— enables the featurerelevance_prompt: Option<String>— optional custom relevance criteria
How it works:
- When enabled, the system prompt instructs the LLM to include
"relevant": true|false - If the model returns
false, a budget credit is atomically accumulated - Credits are drained in the crawl loop to restore the wildcard budget
- Default fallback is
true(assume relevant) if the model omits the field
Example:
let cfgs = RemoteMultimodalConfigs::new(api_url, model)
.with_relevance_gate(Some("Only pages about Rust programming".into()));Full Changelog
- feat(agent): add
relevance_gateandrelevance_prompttoRemoteMultimodalConfig - feat(agent): add atomic
relevance_creditscounter toRemoteMultimodalConfigs - feat(agent): add
relevant: Option<bool>toAutomationResultandAutomationResults - feat(agent): extend system prompt and extraction with relevance gate instructions
- feat(spider): add
restore_wildcard_budget()for budget refund - feat(spider): drain relevance credits in crawl loop dequeue