8. Evaluation & Benchmarks
Evaluating AI agents is fundamentally different from evaluating static LLMs. Agents must be assessed on their ability to reason, plan, use tools, recover from errors, and complete multi-step tasks — not just generate text.
8.1 Why Agent Evaluation Matters
Key Metrics
| Metric | Description | How to Measure |
|---|---|---|
| Success Rate | Percentage of tasks completed correctly | Ground truth comparison |
| Token Cost | Total tokens consumed per task | Token counting |
| Latency | Time to complete a task | Wall-clock timing |
| Tool Accuracy | Correct tool selection and usage | Action log analysis |
| Error Recovery | Ability to recover from failures | Injected failure tests |
| Planning Quality | Efficiency of task decomposition | Expert evaluation |
8.2 SWE-bench
The primary benchmark for software engineering agents.
Overview
SWE-bench evaluates an agent's ability to resolve real GitHub issues by generating patches.
Variants
| Variant | Issues | Description |
|---|---|---|
| SWE-bench Full | 2,294 | Complete benchmark from 12 Python repos |
| SWE-bench Lite | 300 | Curated subset for faster evaluation |
| SWE-bench Verified | 500 | Human-verified for reliable evaluation |
Evaluation Process
- Input: Agent receives a GitHub issue (description + repo state)
- Execution: Agent explores codebase, identifies bug, generates patch
- Evaluation: Patch applied, test suite run
- Result: Pass if all relevant tests pass
Progress Over Time
| Period | Top Score (Verified) | Key Agent |
|---|---|---|
| 2024 Q1 | ~4% | Early SWE-Agent |
| 2024 Q2 | ~12% | SWE-Agent + GPT-4 |
| 2024 Q3 | ~22% | Agentless, AutoCodeRover |
| 2024 Q4 | ~33% | OpenAI, Anthropic agents |
| 2025 Q1 | ~42% | Claude Code, Codex |
| 2025 Q2 | ~48% | Multi-agent approaches |
| 2026 Q1 | ~53% | Continued improvement |
For the latest scores, see swebench.com
8.3 WebArena & OSWorld
WebArena
Evaluates agents on realistic web interaction tasks.
Key Features:
- 812 web tasks across 5 web applications
- Tasks include: information retrieval, form filling, navigation
- Fully reproducible web environment
- Realistic, open-ended tasks
| Category | Example Tasks |
|---|---|
| Information Finding | "Find the cheapest 4-star hotel in NYC" |
| Form Filling | "Submit an expense report for $50 lunch" |
| Navigation | "Go to Settings → Privacy → Change to Friends Only" |
| Data Entry | "Create a new contact with the following details" |
OSWorld
Evaluates agents on real desktop OS tasks.
Key Features:
- 369 tasks across Ubuntu, Windows, macOS
- Real operating system environments
- Multi-application workflows
- File, application, and system operations
| Metric | WebArena (2025) | OSWorld (2025) |
|---|---|---|
| Top Agent Score | ~48% | ~22% |
| Human Baseline | ~78% | ~72% |
| Gap | 30% | 50% |
8.4 General Agent Benchmarks
GAIA (General AI Assistants)
Evaluates general AI assistant capabilities across difficulty levels.
| Level | Tasks | Description |
|---|---|---|
| Level 1 | 153 | Straightforward, single-step |
| Level 2 | 251 | Multi-step reasoning, tool use |
| Level 3 | 96 | Complex, multi-modal, long-horizon |
AgentBench
Multi-dimensional agent evaluation across 8 environments.
τ-bench
Evaluates agents on realistic customer service tasks with policy compliance.
- Tests following complex policies
- Evaluates tool usage accuracy
- Measures conversation quality
- Includes adversarial user scenarios
Claw-Eval-Live(2026)
一个面向真实工作流的动态 Agent 基准测试,与传统冻结任务集不同,Claw-Eval-Live 的任务会随时间演化。
核心特点:
- 活基准:任务集持续更新,反映真实工作流的变化
- 端到端验证:不仅评估最终响应,还验证任务是否真正被执行
- 多工具覆盖:跨越软件工具、业务服务和本地工作区
- 可演化性:支持新工作流场景的动态添加
这代表了 Agent 评估的新范式——从静态 benchmark 向动态、可演化的评估体系转变。
来源:arXiv 2604.28139(2026-04-30)
Synthetic Computers at Scale(2026)
微软研究院提出的大规模合成计算机方法,用于长时程生产力模拟。核心思路是创建模拟的用户计算机环境(含文件夹层级和内容丰富的文档),然后运行跨越 8+ 小时、平均 2000+ 轮次的 Agent 模拟。
核心 创新:
- 1,000 个合成计算机环境:每个都有真实的目录结构和文档(PPT、Excel 等)
- 双 Agent 架构:一个 Agent 创建任务目标(模拟一个月的人类工作量),另一个 Agent 执行这些任务
- 规模化潜力:理论上可扩展到数百万甚至数十亿个合成用户世界
实验结果:
- 每次运行平均超过 8 小时 Agent 运行时间
- 在域内和域外生产力评估中都产生了显著的性能提升
- 为 Agent 自我改进和 Agentic RL 提供了基础数据层
这为 Agent 训练开辟了新范式——不再依赖人工标注的轨迹数据,而是通过大规模合成环境自动生成学习信号。
来源:arXiv 2604.28181(2026-04-30)
8.5 LLM-as-a-Judge
Using LLMs to evaluate agent outputs when human evaluation is impractical.
How It Works
Evaluation Approaches
| Approach | Description | Best For |
|---|---|---|
| Single Score | Rate output on 1-5 scale | Quick assessment |
| Pairwise Comparison | Compare two outputs | Relative ranking |
| Reference-Based | Compare to ground truth | Task completion |
| Multi-Criteria | Score on multiple dimensions | Detailed analysis |
| Chain-of-Thought Judge | Judge explains reasoning | Reliability |
Best Practices
- Use strong models as judges (GPT-4o, Claude Opus)
- Provide clear rubrics with specific criteria
- Calibrate with human evaluation on a subset
- Use multiple judges for high-stakes evaluation
- Randomize presentation order for pairwise comparisons
- Track inter-annotator agreement between LLM and human judges
Example Judge Prompt
You are evaluating an AI agent's response.
Task: {original_task}
Agent Response: {agent_response}
Evaluate on these criteria (1-5 scale each):
1. Task Completion: Did the agent fully complete the task?
2. Accuracy: Is the information correct?
3. Tool Usage: Were tools used appropriately?
4. Efficiency: Was the approach efficient?
5. Clarity: Is the response clear and well-structured?
Provide a score for each criterion and overall assessment.
8.6 Building an Evaluation Pipeline
Architecture
Implementation Checklist
- Define clear success criteria for each task type
- Create diverse test set covering edge cases
- Implement action logging for all agent steps
- Set up automated evaluation with ground truth
- Add LLM-as-Judge for qualitative assessment
- Establish baseline with human performance
- Track metrics over time (regression testing)
- Include cost and latency metrics
- Regular calibration with human evaluators
8.7 Evaluation Frameworks & Tools
| Tool | Type | Description |
|---|---|---|
| LangSmith | Platform | Tracing, evaluation, and testing for LangChain |
| Promptfoo | CLI | Prompt evaluation and comparison |
| Ragas | Library | RAG-specific evaluation metrics |
| DeepEval | Library | LLM evaluation framework |
| Arize Phoenix | Platform | LLM observability and evaluation |
| Braintrust | Platform | Evaluation and experiment tracking |
8.8 Evaluation Costs: The Emerging Bottleneck (2026)
Evaluation is becoming prohibitively expensive — in some cases rivaling training costs:
| Benchmark | Cost | Details |
|---|---|---|
| HAL (Agent Leaderboard) | ~$40,000 | 21,730 agent rollouts across 9 models |
| GAIA (single frontier model) | $2,829 | Per run, before caching |
| MLE-Bench (full sweep) | ~$100,000 | 75 competitions × 3 seeds × 6 models |
| HELM (30 models) | ~$100,000 | Aggregate API + GPU costs |
| The Well (scientific ML) | 960 H100-hours | Single architecture evaluation |
Why Agent Evals Are Especially Expensive
- Scaffold sensitivity: Exgentic's $22K sweep found a 33× cost spread on identical tasks just from scaffold choice
- Non-compressibility: Unlike static benchmarks (MMLU compressible from 14K to 100 items with 2% error), agent benchmarks are noisy and scaffold-sensitive
- Repeated runs: Adding reliability requires multiple runs, multiplying costs
- Inference scaling: UK-AISI scaled agentic steps into millions to study inference-time compute
Cost Reduction Strategies
- Flash-HELM approach: Run cheap evaluations first, spend high-resolution compute only on top candidates (100-200× reduction)
- Anchor Points: 1-30 examples can rank-order 87 model/prompt pairs on GLUE
- tinyBenchmarks: Item Response Theory to select minimal representative subsets
- Caching: Reuse rollouts across evaluations where possible
Industry implication: As models improve, evaluation becomes the bottleneck. Teams must budget for eval compute alongside training compute.
🔗 参考:HuggingFace: AI evals are becoming the new compute bottleneck
Open Agent Leaderboard(2026)
HuggingFace 联合 IBM Research 推出的开放 Agent 评估排行榜,旨在为 AI Agent 的实际能力提供标准化、可复现的评测基准。
核心特点:
- 开放评估:所有评测任务公开透明,支持社区贡献和复现
- 多维度覆盖:涵盖工具调用、推理、多轮对话等 Agent 核心能力
- 企业级参与:由 IBM Research 和 HuggingFace 共同维护,兼顾学术严谨性与工业实用性
这标志着 Agent 评估从学术界走向产业标准化的关键一步。随着 Agent 在生产环境中广泛部署,标准化评估将成为选择和优化 Agent 的基础。
来源:HuggingFace Blog: The Open Agent Leaderboard(2026-05-18)
DeepWeb-Bench(2026-05)
由 Sixiong Xie 等人提出的深度研究(Deep Research)基准测试。当前前沿 LLM 的深度研究能力已使既有基准饱和,DeepWeb-Bench 专注于需要大规模跨源证据收集和长链条推导的复杂任务。
核心特点:
- 要求 Agent 在开放网络上搜索、收集证据并经过扩展推理得出答案
- 任务设计远超现有基准的难度,专门测试多源信息整合能力
- 覆盖需要综合多个独立信息源才能回答的深度研究 场景
来源:arXiv:2605.21482(2026-05-20)
8.9 Key Takeaways
- Agent evaluation is multi-dimensional — not just text quality
- SWE-bench is the standard for coding agent evaluation
- WebArena and OSWorld test GUI interaction capabilities
- LLM-as-a-Judge enables scalable but approximate evaluation
- Always combine automated + human evaluation for reliable results
- Track cost and latency alongside quality metrics
Begin with task completion rate as your primary metric. Add more dimensions (cost, latency, tool accuracy) as your evaluation matures.
Choose benchmarks that match your use case:
- Coding agents → SWE-bench
- Web agents → WebArena
- Desktop agents → OSWorld
- General agents → GAIA / AgentBench