WEBTASKBENCH PROTOCOL · v1.0

Benchmark Protocol

A formal methodology for measuring and reporting token efficiency in AI agent web fetching pipelines. Results published using this protocol are comparable and reproducible.

This protocol governs all data published at webtaskbench.com. Third parties may use it to publish comparable results.

Purpose

The WebTaskBench Protocol defines a standard methodology for measuring how efficiently different tools represent web pages for language model consumption. It establishes:

A common baseline (raw HTML via unauthenticated HTTP GET)
A standard tokenization method
Required reporting fields
Data quality thresholds
A machine-readable result format

Any tool or framework claiming token efficiency improvements can publish results using this protocol. Results are only comparable when produced under the same conditions.

Definitions

Compression ratio: html_tokens / som_tokens. A ratio of 10x means the tool produced output 10 times smaller than raw HTML.
HTML baseline: Raw HTML fetched via unauthenticated HTTP GET (equivalent to curl -sL <url>), without JavaScript rendering, cookie consent, or authentication.
Token count: Number of tokens produced by the tiktoken cl100k_base tokenizer (compatible with GPT-3.5/4/4o).
Session: A single fetch of one URL. Sessions are independent and stateless.
Failure: A session where the tool could not produce output within the timeout, or where the tool returned an error page, anti-bot challenge, or non-content response.
Category: A site classification (SaaS & Cloud, News & Media, Dev Tools, General) assigned by the benchmark maintainer.

Measurement Methodology

3.1 HTML Baseline

curl -sL --max-time 30 \
  -H "User-Agent: WebTaskBench/1.0" \
  "<url>"

The response body is tokenized directly. No parsing, no rendering, no modification.

3.2 Tool Output

The tool under test fetches the same URL and produces its output in its native format. For Plasmate, this is SOM JSON. For Firecrawl, this is Markdown. For any tool, it is the default output format as shipped.

Output is tokenized using tiktoken cl100k_base.

3.3 Ratio Calculation

ratio = round(html_tokens / tool_tokens, 1)

If tool_tokens > html_tokens, ratio is reported as < 1.0 (tool made it larger). This is a valid result.

3.4 Timeout

Sessions timeout at 30,000ms (30 seconds). Timeouts are reported as failures.

3.5 Freshness

Data must be collected within 30 days of publication. Data older than 30 days must be re-collected before being cited.

Site Selection

Minimum requirements for a valid benchmark run:

At least 20 sites
At least 3 categories represented
No more than 40% of sites from any single category
Sites must be publicly accessible without authentication
Sites must not be controlled by the benchmark runner (no testing your own site only)

Recommended: 50+ sites across all four categories, balanced between popular and niche.

Result Format

Benchmark results must be published as a JSON file conforming to the following schema:

{
  "protocol_version": "1.0",
  "published": "2026-04-01T00:00:00Z",
  "tool": {
    "name": "Plasmate",
    "version": "0.4.1",
    "source": "https://github.com/plasmate-labs/plasmate",
    "license": "Apache 2.0"
  },
  "baseline": {
    "method": "curl -sL",
    "user_agent": "WebTaskBench/1.0",
    "timeout_ms": 30000
  },
  "tokenizer": {
    "library": "tiktoken",
    "model": "cl100k_base"
  },
  "environment": {
    "platform": "Linux x86_64",
    "timeout_ms": 30000
  },
  "results": [
    {
      "url": "https://cloud.google.com",
      "category": "SaaS & Cloud",
      "html_tokens": 759234,
      "tool_tokens": 6436,
      "ratio": 117.9,
      "status": "success"
    },
    {
      "url": "https://stackoverflow.com",
      "category": "Dev Tools",
      "html_tokens": null,
      "tool_tokens": null,
      "ratio": null,
      "status": "failure",
      "failure_reason": "Anti-bot detection (Cloudflare challenge)"
    }
  ],
  "summary": {
    "sites_attempted": 51,
    "sites_succeeded": 44,
    "sites_failed": 7,
    "avg_ratio": 17.5,
    "median_ratio": 9.4,
    "peak_ratio": 117.9,
    "peak_url": "https://cloud.google.com",
    "by_category": {
      "SaaS & Cloud": { "n": 12, "avg_ratio": 47.1 },
      "News & Media": { "n": 8, "avg_ratio": 41.3 },
      "Dev Tools": { "n": 18, "avg_ratio": 11.8 },
      "General": { "n": 6, "avg_ratio": 3.9 }
    }
  }
}

Submission

Third parties may submit benchmark results to webtaskbench.com for inclusion in the registry by:

Running the benchmark following this protocol
Producing a result JSON conforming to the schema in Section 5
Opening a pull request to plasmate-labs/plasmate-benchmarks with the result file in results/third-party/<tool-name>-<date>.json
Including a brief methodology note explaining any deviations from the standard

Results are reviewed by the webtaskbench maintainers and listed in the registry if they meet quality thresholds.

Quality Thresholds

Results are accepted if:

At least 20 sites succeed
The result file validates against the Section 5 schema
Failure reasons are documented for all failed sites
Data is less than 30 days old at time of submission
The tool version is specified and the source is publicly accessible

Results are rejected if:

Sites were cherry-picked (e.g., only sites where the tool performs well)
The HTML baseline was modified (e.g., after JavaScript rendering)
The tokenizer differs from cl100k_base without explicit justification

Versioning

This protocol is versioned. The current version is 1.0. Breaking changes (changes to measurement methodology) increment the major version. Additive changes (new optional fields) increment the minor version.

Results produced under different major versions are not directly comparable.