Sites That Blocked Us
7 out of 51 attempted sites could not be benchmarked. Transparency about failures is important for a credible benchmark.
| Site | Reason |
|---|---|
| stackoverflow.com | Anti-bot detection (Cloudflare challenge page) |
| reddit.com | Anti-bot detection (requires JavaScript rendering) |
| w3.org | Heavy server-side protection and rate limiting |
| reuters.com | Anti-bot detection (cookie consent wall + JS challenge) |
| techcrunch.com | Anti-bot detection (Cloudflare challenge page) |
| dev.to | Heavy JavaScript rendering required (SPA shell only) |
| mysql.com | Anti-bot detection (Oracle enterprise bot protection) |
Why Sites Fail
The benchmark uses curl -sL to fetch pages, which is a plain HTTP client without JavaScript execution. This mirrors how many AI agent tools fetch web content.
Anti-Bot Detection
Sites like StackOverflow, Reddit, Reuters, and TechCrunch use services like Cloudflare to detect and block automated access. The curl request receives a challenge page instead of the actual content. This is the most common failure mode.
JavaScript-Only Rendering
Sites like dev.to are single-page applications that deliver a minimal HTML shell and render all content via JavaScript. Without a browser engine, the fetched HTML contains almost no usable content.
Enterprise Protection
mysql.com uses Oracle's enterprise-grade bot protection, which blocks automated requests. w3.org applies aggressive rate limiting and server-side protection.
Implications for AI Agents
These failures highlight a real challenge for AI agents: not all of the web is accessible via simple HTTP fetching. Agents that need content from these sites require browser-based approaches, which add complexity and latency.
Future benchmark runs may include a browser-based fetching mode to capture these sites. For now, we report them honestly as failures.