METHODOLOGY

How We Measure

This page describes the methodology used for webtaskbench.com results. For the formal specification open to third-party benchmark submissions, see the WebTaskBench Protocol.

Overview

WebTaskBench measures the token efficiency of SOM (Semantic Object Model) compared to raw HTML when fetching web pages for AI agents. The goal is to quantify how many tokens can be saved by using a structured, semantic representation instead of raw markup.

HTML Baseline

The HTML baseline is captured using curl -sL, which follows redirects and fetches the raw HTTP response. This represents what an AI agent would receive if it simply fetched a page without a JavaScript-capable browser.

This is intentionally a conservative baseline. Many AI agent tools use similar HTTP-only fetching. Browser-rendered pages would typically contain even more tokens due to dynamically inserted content, making the compression ratio potentially higher in practice.

SOM Generation

SOM output is generated by Plasmate v0.3.0, which parses the HTML and produces a semantic object model. The SOM representation strips away presentational markup, scripts, styles, and other noise, retaining only the meaningful content and structure.

Token Counting

All token counts use the tiktoken cl100k_base tokenizer, which is the encoding used by GPT-4 and GPT-4o. This provides a consistent and widely-understood metric for token consumption.

Different models use different tokenizers, so actual token counts will vary slightly. However, the relative compression ratios should remain consistent across tokenizers.

What "Failure" Means

A site is marked as "failed" when the curl-based fetch does not return usable HTML content. This typically happens due to:

Anti-bot detection (Cloudflare challenges, CAPTCHA walls)
JavaScript-only rendering (SPAs that return empty shells)
Server-side rate limiting or IP blocking
Cookie consent walls that block content

7 out of 51 attempted sites failed in this benchmark run. These failures are documented transparently on the Failed Sites page.

When SOM Is Larger

For 8 out of 44 sites, the SOM representation is actually larger than the raw HTML. This happens with minimal sites (like example.com, which is only 152 tokens of HTML) where the SOM overhead exceeds the content. For these sites, an agent would be better served by the raw HTML.

This is expected and honest. SOM shines on complex, content-rich pages where there is significant markup overhead to strip away. Simple pages have little to compress.

Known Limitations

No JavaScript rendering: The curl baseline does not execute JavaScript. Sites that load content dynamically will show lower HTML token counts than what a browser would see.
Single snapshot: Pages change over time. These results represent a single fetch on April 1, 2026.
Geographic variation: Content may vary by region. All fetches were made from a single Linux x86_64 server.
Home pages only: Most URLs point to home pages or top-level docs. Sub-pages may have different characteristics.
One tokenizer: Only cl100k_base is used. Other tokenizers may yield slightly different ratios.

Benchmark Details

Plasmate version	0.3.0
HTML baseline	curl -sL (raw HTTP, no rendering)
Token counter	tiktoken cl100k_base (GPT-4 tokenizer)
Date	April 1, 2026
Platform	Linux x86_64
Sites	51 attempted, 44 successful, 7 failed
Source	github.com/plasmate-labs/plasmate-benchmarks

Reproduce This Benchmark

git clone https://github.com/plasmate-labs/plasmate-benchmarks
cd plasmate-benchmarks
./run-benchmark.sh