Benchmark Protocol
A formal methodology for measuring and reporting token efficiency in AI agent web fetching pipelines. Results published using this protocol are comparable and reproducible.
This protocol governs all data published at webtaskbench.com. Third parties may use it to publish comparable results.
Purpose
The WebTaskBench Protocol defines a standard methodology for measuring how efficiently different tools represent web pages for language model consumption. It establishes:
- A common baseline (raw HTML via unauthenticated HTTP GET)
- A standard tokenization method
- Required reporting fields
- Data quality thresholds
- A machine-readable result format
Any tool or framework claiming token efficiency improvements can publish results using this protocol. Results are only comparable when produced under the same conditions.
Definitions
- Compression ratio
html_tokens / som_tokens. A ratio of 10x means the tool produced output 10 times smaller than raw HTML.- HTML baseline
- Raw HTML fetched via unauthenticated HTTP GET (equivalent to
curl -sL <url>), without JavaScript rendering, cookie consent, or authentication. - Token count
- Number of tokens produced by the tiktoken
cl100k_basetokenizer (compatible with GPT-3.5/4/4o). - Session
- A single fetch of one URL. Sessions are independent and stateless.
- Failure
- A session where the tool could not produce output within the timeout, or where the tool returned an error page, anti-bot challenge, or non-content response.
- Category
- A site classification (SaaS & Cloud, News & Media, Dev Tools, General) assigned by the benchmark maintainer.
Measurement Methodology
3.1 HTML Baseline
curl -sL --max-time 30 \
-H "User-Agent: WebTaskBench/1.0" \
"<url>"The response body is tokenized directly. No parsing, no rendering, no modification.
3.2 Tool Output
The tool under test fetches the same URL and produces its output in its native format. For Plasmate, this is SOM JSON. For Firecrawl, this is Markdown. For any tool, it is the default output format as shipped.
Output is tokenized using tiktoken cl100k_base.
3.3 Ratio Calculation
ratio = round(html_tokens / tool_tokens, 1)If tool_tokens > html_tokens, ratio is reported as < 1.0 (tool made it larger). This is a valid result.
3.4 Timeout
Sessions timeout at 30,000ms (30 seconds). Timeouts are reported as failures.
3.5 Freshness
Data must be collected within 30 days of publication. Data older than 30 days must be re-collected before being cited.
Site Selection
Minimum requirements for a valid benchmark run:
- At least 20 sites
- At least 3 categories represented
- No more than 40% of sites from any single category
- Sites must be publicly accessible without authentication
- Sites must not be controlled by the benchmark runner (no testing your own site only)
Recommended: 50+ sites across all four categories, balanced between popular and niche.
Result Format
Benchmark results must be published as a JSON file conforming to the following schema:
{
"protocol_version": "1.0",
"published": "2026-04-01T00:00:00Z",
"tool": {
"name": "Plasmate",
"version": "0.4.1",
"source": "https://github.com/plasmate-labs/plasmate",
"license": "Apache 2.0"
},
"baseline": {
"method": "curl -sL",
"user_agent": "WebTaskBench/1.0",
"timeout_ms": 30000
},
"tokenizer": {
"library": "tiktoken",
"model": "cl100k_base"
},
"environment": {
"platform": "Linux x86_64",
"timeout_ms": 30000
},
"results": [
{
"url": "https://cloud.google.com",
"category": "SaaS & Cloud",
"html_tokens": 759234,
"tool_tokens": 6436,
"ratio": 117.9,
"status": "success"
},
{
"url": "https://stackoverflow.com",
"category": "Dev Tools",
"html_tokens": null,
"tool_tokens": null,
"ratio": null,
"status": "failure",
"failure_reason": "Anti-bot detection (Cloudflare challenge)"
}
],
"summary": {
"sites_attempted": 51,
"sites_succeeded": 44,
"sites_failed": 7,
"avg_ratio": 17.5,
"median_ratio": 9.4,
"peak_ratio": 117.9,
"peak_url": "https://cloud.google.com",
"by_category": {
"SaaS & Cloud": { "n": 12, "avg_ratio": 47.1 },
"News & Media": { "n": 8, "avg_ratio": 41.3 },
"Dev Tools": { "n": 18, "avg_ratio": 11.8 },
"General": { "n": 6, "avg_ratio": 3.9 }
}
}
}Submission
Third parties may submit benchmark results to webtaskbench.com for inclusion in the registry by:
- Running the benchmark following this protocol
- Producing a result JSON conforming to the schema in Section 5
- Opening a pull request to
plasmate-labs/plasmate-benchmarkswith the result file inresults/third-party/<tool-name>-<date>.json - Including a brief methodology note explaining any deviations from the standard
Results are reviewed by the webtaskbench maintainers and listed in the registry if they meet quality thresholds.
Quality Thresholds
Results are accepted if:
- At least 20 sites succeed
- The result file validates against the Section 5 schema
- Failure reasons are documented for all failed sites
- Data is less than 30 days old at time of submission
- The tool version is specified and the source is publicly accessible
Results are rejected if:
- Sites were cherry-picked (e.g., only sites where the tool performs well)
- The HTML baseline was modified (e.g., after JavaScript rendering)
- The tokenizer differs from
cl100k_basewithout explicit justification
Versioning
This protocol is versioned. The current version is 1.0. Breaking changes (changes to measurement methodology) increment the major version. Additive changes (new optional fields) increment the minor version.
Results produced under different major versions are not directly comparable.