8. Evaluation & Benchmarks

Evaluating AI agents is fundamentally different from evaluating static LLMs. Agents must be assessed on their ability to reason, plan, use tools, recover from errors, and complete multi-step tasks — not just generate text.

8.1 Why Agent Evaluation Matters

Key Metrics

Metric	Description	How to Measure
Success Rate	Percentage of tasks completed correctly	Ground truth comparison
Token Cost	Total tokens consumed per task	Token counting
Latency	Time to complete a task	Wall-clock timing
Tool Accuracy	Correct tool selection and usage	Action log analysis
Error Recovery	Ability to recover from failures	Injected failure tests
Planning Quality	Efficiency of task decomposition	Expert evaluation

8.2 SWE-bench

The primary benchmark for software engineering agents.

Overview

SWE-bench evaluates an agent's ability to resolve real GitHub issues by generating patches.

Variants

Variant	Issues	Description
SWE-bench Full	2,294	Complete benchmark from 12 Python repos
SWE-bench Lite	300	Curated subset for faster evaluation
SWE-bench Verified	500	Human-verified for reliable evaluation

Evaluation Process

Input: Agent receives a GitHub issue (description + repo state)
Execution: Agent explores codebase, identifies bug, generates patch
Evaluation: Patch applied, test suite run
Result: Pass if all relevant tests pass

Progress Over Time

Period	Top Score (Verified)	Key Agent
2024 Q1	~4%	Early SWE-Agent
2024 Q2	~12%	SWE-Agent + GPT-4
2024 Q3	~22%	Agentless, AutoCodeRover
2024 Q4	~33%	OpenAI, Anthropic agents
2025 Q1	~42%	Claude Code, Codex
2025 Q2	~48%	Multi-agent approaches
2026 Q1	~53%	Continued improvement

Official Leaderboard

For the latest scores, see swebench.com

8.3 WebArena & OSWorld

WebArena

Evaluates agents on realistic web interaction tasks.

Key Features:

812 web tasks across 5 web applications
Tasks include: information retrieval, form filling, navigation
Fully reproducible web environment
Realistic, open-ended tasks

Category	Example Tasks
Information Finding	"Find the cheapest 4-star hotel in NYC"
Form Filling	"Submit an expense report for $50 lunch"
Navigation	"Go to Settings → Privacy → Change to Friends Only"
Data Entry	"Create a new contact with the following details"

OSWorld

Evaluates agents on real desktop OS tasks.

Key Features:

369 tasks across Ubuntu, Windows, macOS
Real operating system environments
Multi-application workflows
File, application, and system operations

Metric	WebArena (2025)	OSWorld (2025)
Top Agent Score	~48%	~22%
Human Baseline	~78%	~72%
Gap	30%	50%

8.4 General Agent Benchmarks

GAIA (General AI Assistants)

Evaluates general AI assistant capabilities across difficulty levels.

Level	Tasks	Description
Level 1	153	Straightforward, single-step
Level 2	251	Multi-step reasoning, tool use
Level 3	96	Complex, multi-modal, long-horizon

AgentBench

Multi-dimensional agent evaluation across 8 environments.

τ-bench

Evaluates agents on realistic customer service tasks with policy compliance.

Tests following complex policies
Evaluates tool usage accuracy
Measures conversation quality
Includes adversarial user scenarios

Claw-Eval-Live（2026）

一个面向真实工作流的动态 Agent 基准测试，与传统冻结任务集不同，Claw-Eval-Live 的任务会随时间演化。

核心特点：

活基准：任务集持续更新，反映真实工作流的变化
端到端验证：不仅评估最终响应，还验证任务是否真正被执行
多工具覆盖：跨越软件工具、业务服务和本地工作区
可演化性：支持新工作流场景的动态添加

这代表了 Agent 评估的新范式——从静态 benchmark 向动态、可演化的评估体系转变。

来源：arXiv 2604.28139（2026-04-30）

Synthetic Computers at Scale（2026）

微软研究院提出的大规模合成计算机方法，用于长时程生产力模拟。核心思路是创建模拟的用户计算机环境（含文件夹层级和内容丰富的文档），然后运行跨越 8+ 小时、平均 2000+ 轮次的 Agent 模拟。

核心创新：

1,000 个合成计算机环境：每个都有真实的目录结构和文档（PPT、Excel 等）
双 Agent 架构：一个 Agent 创建任务目标（模拟一个月的人类工作量），另一个 Agent 执行这些任务
规模化潜力：理论上可扩展到数百万甚至数十亿个合成用户世界

实验结果：

每次运行平均超过 8 小时 Agent 运行时间
在域内和域外生产力评估中都产生了显著的性能提升
为 Agent 自我改进和 Agentic RL 提供了基础数据层

这为 Agent 训练开辟了新范式——不再依赖人工标注的轨迹数据，而是通过大规模合成环境自动生成学习信号。

来源：arXiv 2604.28181（2026-04-30）

8.5 LLM-as-a-Judge

Using LLMs to evaluate agent outputs when human evaluation is impractical.

How It Works

Evaluation Approaches

Approach	Description	Best For
Single Score	Rate output on 1-5 scale	Quick assessment
Pairwise Comparison	Compare two outputs	Relative ranking
Reference-Based	Compare to ground truth	Task completion
Multi-Criteria	Score on multiple dimensions	Detailed analysis
Chain-of-Thought Judge	Judge explains reasoning	Reliability

Best Practices

Use strong models as judges (GPT-4o, Claude Opus)
Provide clear rubrics with specific criteria
Calibrate with human evaluation on a subset
Use multiple judges for high-stakes evaluation
Randomize presentation order for pairwise comparisons
Track inter-annotator agreement between LLM and human judges

Example Judge Prompt

You are evaluating an AI agent's response.

Task: {original_task}
Agent Response: {agent_response}

Evaluate on these criteria (1-5 scale each):
1. Task Completion: Did the agent fully complete the task?
2. Accuracy: Is the information correct?
3. Tool Usage: Were tools used appropriately?
4. Efficiency: Was the approach efficient?
5. Clarity: Is the response clear and well-structured?

Provide a score for each criterion and overall assessment.

8.6 Building an Evaluation Pipeline

Architecture

Implementation Checklist

Define clear success criteria for each task type
Create diverse test set covering edge cases
Implement action logging for all agent steps
Set up automated evaluation with ground truth
Add LLM-as-Judge for qualitative assessment
Establish baseline with human performance
Track metrics over time (regression testing)
Include cost and latency metrics
Regular calibration with human evaluators

8.7 Evaluation Frameworks & Tools

Tool	Type	Description
LangSmith	Platform	Tracing, evaluation, and testing for LangChain
Promptfoo	CLI	Prompt evaluation and comparison
Ragas	Library	RAG-specific evaluation metrics
DeepEval	Library	LLM evaluation framework
Arize Phoenix	Platform	LLM observability and evaluation
Braintrust	Platform	Evaluation and experiment tracking

8.8 Evaluation Costs: The Emerging Bottleneck (2026)

Evaluation is becoming prohibitively expensive — in some cases rivaling training costs:

Benchmark	Cost	Details
HAL (Agent Leaderboard)	~$40,000	21,730 agent rollouts across 9 models
GAIA (single frontier model)	$2,829	Per run, before caching
MLE-Bench (full sweep)	~$100,000	75 competitions × 3 seeds × 6 models
HELM (30 models)	~$100,000	Aggregate API + GPU costs
The Well (scientific ML)	960 H100-hours	Single architecture evaluation

Why Agent Evals Are Especially Expensive

Scaffold sensitivity: Exgentic's $22K sweep found a 33× cost spread on identical tasks just from scaffold choice
Non-compressibility: Unlike static benchmarks (MMLU compressible from 14K to 100 items with 2% error), agent benchmarks are noisy and scaffold-sensitive
Repeated runs: Adding reliability requires multiple runs, multiplying costs
Inference scaling: UK-AISI scaled agentic steps into millions to study inference-time compute

Cost Reduction Strategies

Flash-HELM approach: Run cheap evaluations first, spend high-resolution compute only on top candidates (100-200× reduction)
Anchor Points: 1-30 examples can rank-order 87 model/prompt pairs on GLUE
tinyBenchmarks: Item Response Theory to select minimal representative subsets
Caching: Reuse rollouts across evaluations where possible

Industry implication: As models improve, evaluation becomes the bottleneck. Teams must budget for eval compute alongside training compute.

🔗 参考：HuggingFace: AI evals are becoming the new compute bottleneck

Open Agent Leaderboard（2026）

HuggingFace 联合 IBM Research 推出的开放 Agent 评估排行榜，旨在为 AI Agent 的实际能力提供标准化、可复现的评测基准。

核心特点：

开放评估：所有评测任务公开透明，支持社区贡献和复现
多维度覆盖：涵盖工具调用、推理、多轮对话等 Agent 核心能力
企业级参与：由 IBM Research 和 HuggingFace 共同维护，兼顾学术严谨性与工业实用性

这标志着 Agent 评估从学术界走向产业标准化的关键一步。随着 Agent 在生产环境中广泛部署，标准化评估将成为选择和优化 Agent 的基础。

来源：HuggingFace Blog: The Open Agent Leaderboard（2026-05-18）

ITBench-AA：企业 IT Agentic 任务基准（2026-05）

Artificial Analysis 联合 IBM Software Innovation Lab 发布 ITBench-AA，首个面向企业 IT Agentic 任务的基准测试系列，首期聚焦 Site Reliability Engineering（SRE）任务。

核心发现：

所有前沿模型得分均低于 50%，ITBench-AA SRE 成为最低饱和度的 Agentic 基准之一
Claude Opus 4.7（Adaptive Reasoning, Max Effort）以 47% 领先，GPT-5.5（xhigh）46%，Qwen3.7 Max 42%
开源权重模型中，GLM-5.1（Reasoning）以 40% 领先，与 Gemini 3.5 Flash（high）持平
Turn 数量差异近 3 倍，更长轨迹并不等于更高准确率——GPT-5.5 平均 31 轮达 46%，Gemini 3.1 Pro Preview 平均 83 轮仅 30%

测试范围：

59 个 SRE 任务（40 公开 + 19 保留），覆盖 Kubernetes 事故响应
Agent 需读取日志、追踪依赖关系、在复杂基础设施中识别根因实体
后续将扩展至 Financial Operations（FinOps）和 CISO 任务

这表明尽管前沿模型在编码基准上表现优异，但在真实企业 IT 运维场景中仍有巨大提升空间。

来源：HuggingFace Blog: ITBench-AA（2026-05-27）

OpenSCAD Architectural 3D LLM Benchmark（2026-05）

ModelRift 发布的 OpenSCAD 建筑建模基准，测试 LLM 生成复杂 3D 建筑模型的能力。参赛模型包括 Codex 5.5 High、Claude Sonnet、Claude Opus、Cursor Composer、Google Antigravity 和 ModelRift。

核心特点：

任务要求 LLM 用 OpenSCAD 代码重建帕特农神庙（Pantheon）等复杂建筑
评估维度包括几何精度、代码可执行性、建筑细节还原度
Google Antigravity 2.0 在此次基准中排名第一

这标志着 LLM 编程评估从纯文本/代码生成扩展到 3D 空间理解和参数化建模领域，对 Agent 的空间推理能力提出更高要求。

来源：ModelRift: OpenSCAD LLM Benchmark（2026-05）

HarnessAPI：统一流式 API 与 MCP 工具（2026-05）

arXiv 论文 2605.22733 提出 HarnessAPI 框架，解决 Python 函数需要同时维护 HTTP 端点（面向人类/CI）和 MCP 工具注册（面向 Agent 运行时）的双重维护问题。

核心理念：Skill-First 设计范式——每个 Python 函数只需定义一次，自动同时暴露为流式 HTTP API 和 MCP 工具。

来源：arXiv:2605.22733（2026-05-21）

DeepWeb-Bench（2026-05）

由 Sixiong Xie 等人提出的深度研究（Deep Research）基准测试。当前前沿 LLM 的深度研究能力已使既有基准饱和，DeepWeb-Bench 专注于需要大规模跨源证据收集和长链条推导的复杂任务。

核心特点：

要求 Agent 在开放网络上搜索、收集证据并经过扩展推理得出答案
任务设计远超现有基准的难度，专门测试多源信息整合能力
覆盖需要综合多个独立信息源才能回答的深度研究场景

来源：arXiv:2605.21482（2026-05-20）

8.9 Key Takeaways

Agent evaluation is multi-dimensional — not just text quality
SWE-bench is the standard for coding agent evaluation
WebArena and OSWorld test GUI interaction capabilities
LLM-as-a-Judge enables scalable but approximate evaluation
Always combine automated + human evaluation for reliable results
Track cost and latency alongside quality metrics

Start Simple

Begin with task completion rate as your primary metric. Add more dimensions (cost, latency, tool accuracy) as your evaluation matures.

Benchmark Selection

Choose benchmarks that match your use case:

Coding agents → SWE-bench
Web agents → WebArena
Desktop agents → OSWorld
General agents → GAIA / AgentBench

8.1 Why Agent Evaluation Matters​

Key Metrics​

8.2 SWE-bench​

Overview​

Variants​

Evaluation Process​

Progress Over Time​

8.3 WebArena & OSWorld​

WebArena​

OSWorld​

8.4 General Agent Benchmarks​

GAIA (General AI Assistants)​

AgentBench​

τ-bench​

Claw-Eval-Live（2026）​

Synthetic Computers at Scale（2026）​

8.5 LLM-as-a-Judge​

How It Works​

Evaluation Approaches​

Best Practices​

Example Judge Prompt​

8.6 Building an Evaluation Pipeline​

Architecture​

Implementation Checklist​

8.7 Evaluation Frameworks & Tools​

8.8 Evaluation Costs: The Emerging Bottleneck (2026)​

Why Agent Evals Are Especially Expensive​

Cost Reduction Strategies​

Open Agent Leaderboard（2026）​

ITBench-AA：企业 IT Agentic 任务基准（2026-05）​

OpenSCAD Architectural 3D LLM Benchmark（2026-05）​

HarnessAPI：统一流式 API 与 MCP 工具（2026-05）​

DeepWeb-Bench（2026-05）​

8.9 Key Takeaways​

8.1 Why Agent Evaluation Matters

Key Metrics

8.2 SWE-bench

Overview

Variants

Evaluation Process

Progress Over Time

8.3 WebArena & OSWorld

WebArena

OSWorld

8.4 General Agent Benchmarks

GAIA (General AI Assistants)

AgentBench

τ-bench

Claw-Eval-Live（2026）

Synthetic Computers at Scale（2026）

8.5 LLM-as-a-Judge

How It Works

Evaluation Approaches

Best Practices

Example Judge Prompt

8.6 Building an Evaluation Pipeline

Architecture

Implementation Checklist

8.7 Evaluation Frameworks & Tools

8.8 Evaluation Costs: The Emerging Bottleneck (2026)

Why Agent Evals Are Especially Expensive

Cost Reduction Strategies

Open Agent Leaderboard（2026）

ITBench-AA：企业 IT Agentic 任务基准（2026-05）

OpenSCAD Architectural 3D LLM Benchmark（2026-05）

HarnessAPI：统一流式 API 与 MCP 工具（2026-05）

DeepWeb-Bench（2026-05）

8.9 Key Takeaways