///WebTaskBench
FAILED SITES

Sites That Blocked Us

7 out of 51 attempted sites could not be benchmarked. Transparency about failures is important for a credible benchmark.

SiteReason
stackoverflow.comAnti-bot detection (Cloudflare challenge page)
reddit.comAnti-bot detection (requires JavaScript rendering)
w3.orgHeavy server-side protection and rate limiting
reuters.comAnti-bot detection (cookie consent wall + JS challenge)
techcrunch.comAnti-bot detection (Cloudflare challenge page)
dev.toHeavy JavaScript rendering required (SPA shell only)
mysql.comAnti-bot detection (Oracle enterprise bot protection)

Why Sites Fail

The benchmark uses curl -sL to fetch pages, which is a plain HTTP client without JavaScript execution. This mirrors how many AI agent tools fetch web content.

Anti-Bot Detection

Sites like StackOverflow, Reddit, Reuters, and TechCrunch use services like Cloudflare to detect and block automated access. The curl request receives a challenge page instead of the actual content. This is the most common failure mode.

JavaScript-Only Rendering

Sites like dev.to are single-page applications that deliver a minimal HTML shell and render all content via JavaScript. Without a browser engine, the fetched HTML contains almost no usable content.

Enterprise Protection

mysql.com uses Oracle's enterprise-grade bot protection, which blocks automated requests. w3.org applies aggressive rate limiting and server-side protection.

Implications for AI Agents

These failures highlight a real challenge for AI agents: not all of the web is accessible via simple HTTP fetching. Agents that need content from these sites require browser-based approaches, which add complexity and latency.

Future benchmark runs may include a browser-based fetching mode to capture these sites. For now, we report them honestly as failures.