Skip to main content

8. Evaluation & Benchmarks

Evaluating AI agents is fundamentally different from evaluating static LLMs. Agents must be assessed on their ability to reason, plan, use tools, recover from errors, and complete multi-step tasks — not just generate text.


8.1 Why Agent Evaluation Matters

Key Metrics

MetricDescriptionHow to Measure
Success RatePercentage of tasks completed correctlyGround truth comparison
Token CostTotal tokens consumed per taskToken counting
LatencyTime to complete a taskWall-clock timing
Tool AccuracyCorrect tool selection and usageAction log analysis
Error RecoveryAbility to recover from failuresInjected failure tests
Planning QualityEfficiency of task decompositionExpert evaluation

8.2 SWE-bench

The primary benchmark for software engineering agents.

Overview

SWE-bench evaluates an agent's ability to resolve real GitHub issues by generating patches.

Variants

VariantIssuesDescription
SWE-bench Full2,294Complete benchmark from 12 Python repos
SWE-bench Lite300Curated subset for faster evaluation
SWE-bench Verified500Human-verified for reliable evaluation

Evaluation Process

  1. Input: Agent receives a GitHub issue (description + repo state)
  2. Execution: Agent explores codebase, identifies bug, generates patch
  3. Evaluation: Patch applied, test suite run
  4. Result: Pass if all relevant tests pass

Progress Over Time

PeriodTop Score (Verified)Key Agent
2024 Q1~4%Early SWE-Agent
2024 Q2~12%SWE-Agent + GPT-4
2024 Q3~22%Agentless, AutoCodeRover
2024 Q4~33%OpenAI, Anthropic agents
2025 Q1~42%Claude Code, Codex
2025 Q2~48%Multi-agent approaches
2026 Q1~53%Continued improvement
Official Leaderboard

For the latest scores, see swebench.com


8.3 WebArena & OSWorld

WebArena

Evaluates agents on realistic web interaction tasks.

Key Features:

  • 812 web tasks across 5 web applications
  • Tasks include: information retrieval, form filling, navigation
  • Fully reproducible web environment
  • Realistic, open-ended tasks
CategoryExample Tasks
Information Finding"Find the cheapest 4-star hotel in NYC"
Form Filling"Submit an expense report for $50 lunch"
Navigation"Go to Settings → Privacy → Change to Friends Only"
Data Entry"Create a new contact with the following details"

OSWorld

Evaluates agents on real desktop OS tasks.

Key Features:

  • 369 tasks across Ubuntu, Windows, macOS
  • Real operating system environments
  • Multi-application workflows
  • File, application, and system operations
MetricWebArena (2025)OSWorld (2025)
Top Agent Score~48%~22%
Human Baseline~78%~72%
Gap30%50%

8.4 General Agent Benchmarks

GAIA (General AI Assistants)

Evaluates general AI assistant capabilities across difficulty levels.

LevelTasksDescription
Level 1153Straightforward, single-step
Level 2251Multi-step reasoning, tool use
Level 396Complex, multi-modal, long-horizon

AgentBench

Multi-dimensional agent evaluation across 8 environments.

τ-bench

Evaluates agents on realistic customer service tasks with policy compliance.

  • Tests following complex policies
  • Evaluates tool usage accuracy
  • Measures conversation quality
  • Includes adversarial user scenarios

Claw-Eval-Live(2026)

一个面向真实工作流的动态 Agent 基准测试,与传统冻结任务集不同,Claw-Eval-Live 的任务会随时间演化。

核心特点

  • 活基准:任务集持续更新,反映真实工作流的变化
  • 端到端验证:不仅评估最终响应,还验证任务是否真正被执行
  • 多工具覆盖:跨越软件工具、业务服务和本地工作区
  • 可演化性:支持新工作流场景的动态添加

这代表了 Agent 评估的新范式——从静态 benchmark 向动态、可演化的评估体系转变。

来源:arXiv 2604.28139(2026-04-30)

Synthetic Computers at Scale(2026)

微软研究院提出的大规模合成计算机方法,用于长时程生产力模拟。核心思路是创建模拟的用户计算机环境(含文件夹层级和内容丰富的文档),然后运行跨越 8+ 小时、平均 2000+ 轮次的 Agent 模拟。

核心创新

  • 1,000 个合成计算机环境:每个都有真实的目录结构和文档(PPT、Excel 等)
  • 双 Agent 架构:一个 Agent 创建任务目标(模拟一个月的人类工作量),另一个 Agent 执行这些任务
  • 规模化潜力:理论上可扩展到数百万甚至数十亿个合成用户世界

实验结果

  • 每次运行平均超过 8 小时 Agent 运行时间
  • 在域内和域外生产力评估中都产生了显著的性能提升
  • 为 Agent 自我改进和 Agentic RL 提供了基础数据层

这为 Agent 训练开辟了新范式——不再依赖人工标注的轨迹数据,而是通过大规模合成环境自动生成学习信号。

来源:arXiv 2604.28181(2026-04-30)


8.5 LLM-as-a-Judge

Using LLMs to evaluate agent outputs when human evaluation is impractical.

How It Works

Evaluation Approaches

ApproachDescriptionBest For
Single ScoreRate output on 1-5 scaleQuick assessment
Pairwise ComparisonCompare two outputsRelative ranking
Reference-BasedCompare to ground truthTask completion
Multi-CriteriaScore on multiple dimensionsDetailed analysis
Chain-of-Thought JudgeJudge explains reasoningReliability

Best Practices

  1. Use strong models as judges (GPT-4o, Claude Opus)
  2. Provide clear rubrics with specific criteria
  3. Calibrate with human evaluation on a subset
  4. Use multiple judges for high-stakes evaluation
  5. Randomize presentation order for pairwise comparisons
  6. Track inter-annotator agreement between LLM and human judges

Example Judge Prompt

You are evaluating an AI agent's response.

Task: {original_task}
Agent Response: {agent_response}

Evaluate on these criteria (1-5 scale each):
1. Task Completion: Did the agent fully complete the task?
2. Accuracy: Is the information correct?
3. Tool Usage: Were tools used appropriately?
4. Efficiency: Was the approach efficient?
5. Clarity: Is the response clear and well-structured?

Provide a score for each criterion and overall assessment.

8.6 Building an Evaluation Pipeline

Architecture

Implementation Checklist

  • Define clear success criteria for each task type
  • Create diverse test set covering edge cases
  • Implement action logging for all agent steps
  • Set up automated evaluation with ground truth
  • Add LLM-as-Judge for qualitative assessment
  • Establish baseline with human performance
  • Track metrics over time (regression testing)
  • Include cost and latency metrics
  • Regular calibration with human evaluators

8.7 Evaluation Frameworks & Tools

ToolTypeDescription
LangSmithPlatformTracing, evaluation, and testing for LangChain
PromptfooCLIPrompt evaluation and comparison
RagasLibraryRAG-specific evaluation metrics
DeepEvalLibraryLLM evaluation framework
Arize PhoenixPlatformLLM observability and evaluation
BraintrustPlatformEvaluation and experiment tracking

8.8 Evaluation Costs: The Emerging Bottleneck (2026)

Evaluation is becoming prohibitively expensive — in some cases rivaling training costs:

BenchmarkCostDetails
HAL (Agent Leaderboard)~$40,00021,730 agent rollouts across 9 models
GAIA (single frontier model)$2,829Per run, before caching
MLE-Bench (full sweep)~$100,00075 competitions × 3 seeds × 6 models
HELM (30 models)~$100,000Aggregate API + GPU costs
The Well (scientific ML)960 H100-hoursSingle architecture evaluation

Why Agent Evals Are Especially Expensive

  • Scaffold sensitivity: Exgentic's $22K sweep found a 33× cost spread on identical tasks just from scaffold choice
  • Non-compressibility: Unlike static benchmarks (MMLU compressible from 14K to 100 items with 2% error), agent benchmarks are noisy and scaffold-sensitive
  • Repeated runs: Adding reliability requires multiple runs, multiplying costs
  • Inference scaling: UK-AISI scaled agentic steps into millions to study inference-time compute

Cost Reduction Strategies

  1. Flash-HELM approach: Run cheap evaluations first, spend high-resolution compute only on top candidates (100-200× reduction)
  2. Anchor Points: 1-30 examples can rank-order 87 model/prompt pairs on GLUE
  3. tinyBenchmarks: Item Response Theory to select minimal representative subsets
  4. Caching: Reuse rollouts across evaluations where possible

Industry implication: As models improve, evaluation becomes the bottleneck. Teams must budget for eval compute alongside training compute.

🔗 参考:HuggingFace: AI evals are becoming the new compute bottleneck

Open Agent Leaderboard(2026)

HuggingFace 联合 IBM Research 推出的开放 Agent 评估排行榜,旨在为 AI Agent 的实际能力提供标准化、可复现的评测基准。

核心特点

  • 开放评估:所有评测任务公开透明,支持社区贡献和复现
  • 多维度覆盖:涵盖工具调用、推理、多轮对话等 Agent 核心能力
  • 企业级参与:由 IBM Research 和 HuggingFace 共同维护,兼顾学术严谨性与工业实用性

这标志着 Agent 评估从学术界走向产业标准化的关键一步。随着 Agent 在生产环境中广泛部署,标准化评估将成为选择和优化 Agent 的基础。

来源:HuggingFace Blog: The Open Agent Leaderboard(2026-05-18)

DeepWeb-Bench(2026-05)

由 Sixiong Xie 等人提出的深度研究(Deep Research)基准测试。当前前沿 LLM 的深度研究能力已使既有基准饱和,DeepWeb-Bench 专注于需要大规模跨源证据收集长链条推导的复杂任务。

核心特点

  • 要求 Agent 在开放网络上搜索、收集证据并经过扩展推理得出答案
  • 任务设计远超现有基准的难度,专门测试多源信息整合能力
  • 覆盖需要综合多个独立信息源才能回答的深度研究场景

来源:arXiv:2605.21482(2026-05-20)


8.9 Key Takeaways

  1. Agent evaluation is multi-dimensional — not just text quality
  2. SWE-bench is the standard for coding agent evaluation
  3. WebArena and OSWorld test GUI interaction capabilities
  4. LLM-as-a-Judge enables scalable but approximate evaluation
  5. Always combine automated + human evaluation for reliable results
  6. Track cost and latency alongside quality metrics

Start Simple

Begin with task completion rate as your primary metric. Add more dimensions (cost, latency, tool accuracy) as your evaluation matures.

Benchmark Selection

Choose benchmarks that match your use case:

  • Coding agents → SWE-bench
  • Web agents → WebArena
  • Desktop agents → OSWorld
  • General agents → GAIA / AgentBench