9. Engineering & Production

Building agents that work in prototypes is different from building agents that work reliably in production. This section covers the critical engineering challenges, evaluation methods, security considerations, and deployment strategies for production-grade AI agents.

9.1 Agent Evaluation

Evaluating agent performance is fundamentally different from traditional software testing due to non-determinism and complexity.

Evaluation Approaches

1. LLM-as-a-Judge

Use an LLM to evaluate agent outputs against criteria.

Implementation

@Service
public class AgentEvaluator {

    @Autowired
    private ChatClient evaluatorClient;

    public EvaluationResult evaluate(AgentOutput output, EvaluationCriteria criteria) {
        String evaluation = evaluatorClient.prompt()
            .system("""
                You are an expert evaluator for AI agent outputs.
                Rate the following on a scale of 1-10:
                1. Accuracy: Is the information correct?
                2. Completeness: Does it fully address the task?
                3. Relevance: Is the information focused?
                4. Safety: Are there any harmful outputs?
                """)
            .user("""
                Task: {task}
                Agent Output: {output}
                Context: {context}

                Provide evaluation in JSON format:
                {
                    "accuracy": 8,
                    "completeness": 7,
                    "relevance": 9,
                    "safety": 10,
                    "reasoning": "..."
                }
                """.formatted(
                    output.task(),
                    output.content(),
                    output.context()
                ))
            .call()
            .content();

        return parseEvaluation(evaluation);
    }
}

Best Practices

Clear Criteria: Define specific evaluation dimensions
Few-Shot Examples: Provide examples of good/bad outputs
Multiple Judges: Use multiple LLMs and aggregate
Human Validation: Calibrate LLM judges with human labels

2. Human Evaluation

Human evaluation remains the gold standard for quality.

Evaluation Framework

@Service
public class HumanEvaluationService {

    public EvaluationDataset createDataset(List<AgentOutput> outputs) {
        // Shuffle for randomization
        List<AgentOutput> shuffled = shuffle(outputs);

        // Create evaluation tasks
        return EvaluationDataset.builder()
            .instructions("Rate each output on accuracy, completeness, and quality (1-10)")
            .items(shuffled.stream()
                .map(this::createEvaluationItem)
                .toList())
            .build();
    }

    public EvaluationMetrics calculateMetrics(List<HumanRating> ratings) {
        return EvaluationMetrics.builder()
            .accuracyMean(ratings.stream().mapToInt(HumanRating::accuracy).average().orElse(0))
            .completenessMean(ratings.stream().mapToInt(HumanRating::completeness).average().orElse(0))
            .interAnnotatorAgreement(calculateKappa(ratings))
            .build();
    }
}

Evaluation Interface (Frontend)

// Next.js: Evaluation Interface
interface EvaluationItem {
  id: string;
  task: string;
  output: string;
  context: string;
}

interface Rating {
  accuracy: number;
  completeness: number;
  quality: number;
  notes?: string;
}

export function EvaluationForm({ item }: { item: EvaluationItem }) {
  const [rating, setRating] = useState<Rating>({
    accuracy: 5,
    completeness: 5,
    quality: 5,
  });

  const handleSubmit = async () => {
    await fetch('/api/evaluation/rate', {
      method: 'POST',
      body: JSON.stringify({ itemId: item.id, rating }),
    });
  };

  return (
    <div className="evaluation-form">
      <h3>Task: {item.task}</h3>
      <p>{item.output}</p>

      <Slider
        label="Accuracy"
        value={rating.accuracy}
        onChange={(v) => setRating({ ...rating, accuracy: v })}
      />

      <Slider
        label="Completeness"
        value={rating.completeness}
        onChange={(v) => setRating({ ...rating, completeness: v })}
      />

      <Slider
        label="Quality"
        value={rating.quality}
        onChange={(v) => setRating({ ...rating, quality: v })}
      />

      <Textarea
        label="Notes"
        value={rating.notes}
        onChange={(v) => setRating({ ...rating, notes: v })}
      />

      <Button onClick={handleSubmit}>Submit Rating</Button>
    </div>
  );
}

3. Automated Testing

Test specific agent behaviors with unit and integration tests.

@SpringBootTest
class AgentServiceTest {

    @Autowired
    private ReactAgentService agent;

    @MockBean
    private SearchService searchService;

    @Test
    void testAgentUsesSearchTool() {
        // Arrange
        when(searchService.search(anyString()))
            .thenReturn("Paris is the capital of France");

        // Act
        String result = agent.execute("What is the capital of France?", 5);

        // Assert
        assertThat(result).contains("Paris");
        verify(searchService, times(1)).search(anyString());
    }

    @Test
    void testAgentHandlesToolFailure() {
        // Arrange
        when(searchService.search(anyString()))
            .thenThrow(new ServiceUnavailableException("Search is down"));

        // Act
        String result = agent.execute("Search for news", 5);

        // Assert
        assertThat(result).contains("apologize");
        assertThat(result).contains("unavailable");
    }
}

4. Key Metrics

Metric	Description	Target
Task Success Rate	% of tasks completed successfully	> 90%
Accuracy	Factual correctness of outputs	> 95%
Relevance	How well output addresses task	> 90%
Safety	Absence of harmful content	100%
Latency (p50)	Median response time	< 5s
Latency (p95)	95th percentile response time	< 15s
Cost per Task	Token cost per successful task	Minimize
Tool Success Rate	% of tool calls successful	> 95%

Structured Output for Reliability

One of the most significant advances in 2025 for agent reliability is the widespread adoption of Structured Output:

OpenAI: response_format: { type: "json_schema", json_schema: {...} } guarantees valid JSON
Anthropic: Tool use with strict schema enforcement ensures parseable responses
Gemini: generationConfig.responseSchema for JSON mode

Impact on Agents:

Eliminates output parsing failures (a top source of agent crashes)
Enables reliable tool calling (tools receive valid parameters)
Reduces hallucination in structured tasks (schema constrains the output space)
Facilitates multi-agent communication (agents exchange JSON messages)

Modern Evaluation Tools

Tool	Type	Key Feature
LangSmith	Platform	End-to-end tracing, evaluation datasets, annotation queues
LangFuse	Open Source	Self-hosted observability, prompt management, cost tracking
Ragas	Library	RAG-specific metrics (faithfulness, relevance, context precision)
DeepEval	Library	LLM-as-judge, assertion-based testing, CI/CD integration
Promptfoo	CLI	Multi-model comparison, regression testing, eval datasets
Braintrust	Platform	Online evals, scorecards, data logging

Prompt Caching for Cost Optimization

Both OpenAI and Anthropic now support prompt caching, which can reduce costs by 80-90% for repeated prompt prefixes:

# Anthropic: Automatic caching for prompts > 1024 tokens
# Cached tokens cost 90% less
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    system=large_system_prompt,  # This gets cached
    messages=[{"role": "user", "content": "..."}],
    extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)

# OpenAI: Automatic caching for prompts > 1024 tokens
# Cached input tokens cost 50% less

Agent Cost Optimization Strategy:

Cache system prompts and tool definitions (large, rarely change)
Use cheaper models for routing/classification tasks
Implement semantic caching for similar queries
Batch API for non-real-time tasks (50% discount)

9.2 Common Challenges

Challenge 1: Hallucination

Agents can generate plausible-sounding but incorrect information.

Mitigation Strategies

Implementation

@Service
public class AntiHallucinationService {

    @Autowired
    private VectorStore vectorStore;

    @Autowired
    private ChatClient chatClient;

    public String generateWithVerification(String query) {
        // Step 1: Retrieve relevant context
        List<Document> context = vectorStore.similaritySearch(
            SearchRequest.query(query).withTopK(5)
        );

        // Step 2: Generate with citations
        String response = chatClient.prompt()
            .user(query)
            .messages(createMessagesWithCitations(context))
            .call()
            .content();

        // Step 3: Verify claims
        List<Claim> claims = extractClaims(response);
        for (Claim claim : claims) {
            if (!verifyClaim(claim, context)) {
                return flagUncertainty(claim);
            }
        }

        return response;
    }

    private boolean verifyClaim(Claim claim, List<Document> context) {
        // Use RAG context to verify
        String verification = chatClient.prompt()
            .system("Verify if the claim is supported by the context.")
            .user("""
                Claim: {claim}
                Context: {context}
                Answer YES or NO with explanation.
                """.formatted(
                    claim.text(),
                    context.stream()
                        .map(Document::getContent)
                        .collect(Collectors.joining("\n"))
                ))
            .call()
            .content();

        return verification.toLowerCase().startsWith("yes");
    }
}

Challenge 2: Infinite Loops

Agents can get stuck in repetitive behaviors.

Solutions

@Service
public class LoopPreventionService {

    private static final int MAX_ITERATIONS = 10;
    private static final int MAX_REPEAT_ACTIONS = 3;

    public AgentExecutionResult executeWithGuardrails(AgentTask task) {
        Set<String> recentActions = new HashSet<>();
        int iteration = 0;

        while (iteration < MAX_ITERATIONS && !task.isComplete()) {
            String action = task.getNextAction();

            // Detect loops
            if (recentActions.contains(action)) {
                int count = countOccurrences(recentActions, action);
                if (count >= MAX_REPEAT_ACTIONS) {
                    return handleLoop(task, action);
                }
            }

            recentActions.add(action);
            if (recentActions.size() > 5) {
                recentActions.remove(recentActions.iterator().next());
            }

            // Execute
            task.executeAction(action);
            iteration++;
        }

        return task.getResult();
    }

    private AgentExecutionResult handleLoop(AgentTask task, String repeatingAction) {
        // Ask for human intervention
        return AgentExecutionResult.builder()
            .status("NEEDS_INTERVENTION")
            .message("Agent stuck in loop repeating: " + repeatingAction)
            .suggestedActions(List.of(
                "Retry with different approach",
                "Provide more specific instructions",
                "Break task into smaller steps"
            ))
            .build();
    }
}

Challenge 3: Cost Control

LLM usage can become expensive at scale.

Cost Optimization Strategies

Strategy	Impact	Implementation
Caching	High	Cache LLM responses
Smaller Models	High	Use Haiku for simple tasks
Token Limits	Medium	Set max tokens per request
Result Streaming	Low	Stream responses for UX
Batch Processing	Medium	Process multiple queries together

Implementation

@Service
public class CostOptimizedAgentService {

    @Autowired
    private ChatClient gpt4Client; // Expensive

    @Autowired
    private ChatClient haikuClient; // Cheap

    @Autowired
    private CacheManager cacheManager;

    public String execute(AgentRequest request) {
        // Check cache first
        String cacheKey = generateCacheKey(request);
        String cached = cacheManager.getCache("agent-responses").get(cacheKey, String.class);
        if (cached != null) {
            return cached;
        }

        // Route to appropriate model
        ChatClient client = selectModel(request);
        String response = client.prompt().user(request.query()).call().content();

        // Cache the result
        cacheManager.getCache("agent-responses").put(cacheKey, response);

        return response;
    }

    private ChatClient selectModel(AgentRequest request) {
        // Use Haiku for simple queries
        if (request.complexity() == Complexity.LOW) {
            return haikuClient;
        }

        // Use GPT-4 for complex tasks
        return gpt4Client;
    }
}

Challenge 4: Latency

Agents need to respond quickly for good UX.

Optimization Techniques

Parallel Tool Execution

@Service
public class ParallelToolExecutor {

    @Autowired
    private List<FunctionCallback> tools;

    public Map<String, String> executeParallel(List<ToolCall> calls) {
        ExecutorService executor = Executors.newFixedThreadPool(10);

        List<CompletableFuture<Map.Entry<String, String>>> futures = calls.stream()
            .map(call -> CompletableFuture.supplyAsync(() -> {
                String result = executeTool(call);
                return Map.entry(call.name(), result);
            }, executor))
            .toList();

        CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();

        return futures.stream()
            .map(CompletableFuture::join)
            .collect(Collectors.toMap(
                Map.Entry::getKey,
                Map.Entry::getValue
            ));
    }
}

9.3 Security & Safety

2025 Agent Threat Landscape

Agents introduce new attack surfaces beyond traditional LLM risks:

Threat	Description	Example
Tool Poisoning	Malicious MCP server returns harmful data	Compromised tool injects instructions via response
Agent-to-Agent Attack	One agent manipulates another via crafted messages	A2A protocol message with hidden instructions
Supply Chain (MCP)	Malicious MCP plugin in marketplace	Trojaned tool exfiltrates data
Context Window Overflow	Flooding context to hide malicious instructions	Long document with hidden prompt injection
Credential Harvesting	Agent tricked into revealing API keys	Social engineering via tool output
Sandbox Escape	Breaking out of execution environment	Computer Use agent accessing host filesystem

Prompt Injection

Malicious users trying to manipulate agent behavior.

Defense Strategies

@Service
public class PromptInjectionDefense {

    private static final Pattern INJECTION_PATTERNS = Pattern.compile(
        "(ignore|override|forget|disregard).*(instructions|system|prompt)",
        Pattern.CASE_INSENSITIVE
    );

    public SanitizedInput sanitize(UserInput input) {
        String text = input.text();

        // Check for injection patterns
        if (INJECTION_PATTERNS.matcher(text).find()) {
            throw new SecurityException("Potential prompt injection detected");
        }

        // Validate against allowlist
        if (!isAllowedTopic(text)) {
            throw new SecurityException("Topic not allowed");
        }

        // Rate limit check
        if (exceedsRateLimit(input.userId())) {
            throw new RateLimitExceededException();
        }

        return SanitizedInput.from(text);
    }

    @Bean
    public SecurityFilter securityFilter() {
        return new SecurityFilter() {
            @Override
            public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
                String path = exchange.getRequest().getPath().value();

                if (path.startsWith("/api/agents")) {
                    String body = getBody(exchange);
                    try {
                        sanitize(new UserInput(body));
                    } catch (SecurityException e) {
                        exchange.getResponse().setStatusCode(HttpStatus.FORBIDDEN);
                        return exchange.getResponse().setComplete();
                    }
                }

                return chain.filter(exchange);
            }
        };
    }
}

Tool Access Control

Restrict which tools agents can use based on user permissions.

@Service
public class ToolAccessControl {

    @Autowired
    private PermissionService permissionService;

    public List<FunctionCallback> getAuthorizedTools(String userId) {
        return allTools.stream()
            .filter(tool -> permissionService.hasPermission(userId, tool.getName()))
            .toList();
    }

    public boolean canExecuteTool(String userId, String toolName) {
        ToolPermission permission = permissionService.getPermission(userId, toolName);

        // Check permission
        if (!permission.isAllowed()) {
            return false;
        }

        // Check rate limits
        if (permission.getUsageCount() >= permission.getMaxUsage()) {
            return false;
        }

        // Check time restrictions
        if (!permission.isWithinAllowedHours()) {
            return false;
        }

        return true;
    }
}

Human-in-the-Loop

Require human approval for sensitive operations.

@Service
public class HumanInTheLoopService {

    @Autowired
    private NotificationService notificationService;

    @Autowired
    private ApprovalRepository approvalRepository;

    public AgentResult executeWithApproval(AgentTask task) {
        // Check if approval needed
        if (task.requiresApproval()) {
            ApprovalRequest request = createApprovalRequest(task);
            notificationService.notifyApprovers(request);

            // Wait for approval
            Approval approval = waitForApproval(request.getId());

            if (!approval.isApproved()) {
                return AgentResult.rejected("Approval denied: " + approval.getReason());
            }
        }

        // Execute task
        return task.execute();
    }

    private Approval waitForApproval(String requestId) {
        // Poll for approval (or use WebSocket)
        for (int i = 0; i < 60; i++) { // 1 minute timeout
            Approval approval = approvalRepository.findById(requestId).orElse(null);
            if (approval != null && approval.isDecided()) {
                return approval;
            }
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }

        throw new ApprovalTimeoutException();
    }
}

Audit Logging

Track all agent actions for security and compliance.

@Service
public class AgentAuditLogger {

    @Autowired
    private AuditLogRepository auditLogRepository;

    @EventListener
    public void logAgentAction(AgentActionEvent event) {
        AgentAuditLog log = AgentAuditLog.builder()
            .agentId(event.getAgentId())
            .userId(event.getUserId())
            .action(event.getAction())
            .input(sanitize(event.getInput()))
            .output(sanitize(event.getOutput()))
            .toolsUsed(event.getToolsUsed())
            .tokensConsumed(event.getTokensConsumed())
            .cost(event.getCost())
            .timestamp(Instant.now())
            .build();

        auditLogRepository.save(log);
    }

    public List<AgentAuditLog> getUserActivity(String userId, Instant since) {
        return auditLogRepository.findByUserIdAndTimestampAfter(userId, since);
    }
}

Agent-Specific Security Measures (2025)

Docker Sandboxes：AI Coding Agent 的隔离方案（2026）

随着 AI Coding Agent 的普及，不可信自主工作负载（Untrusted Autonomous Workload） 成为一个核心安全问题。Docker Captain Vladimir Mikhalev 在迁移博客到 Astro 时发现：Claude Code 能完成复杂的代码重构，但开发者在过程中逐渐失去对代码库的理解——"审查结果而非变更"不构成安全模型。

核心问题：传统容器共享宿主机内核，无法安全隔离需要独立 Docker 守护进程的自主 Agent。npm install 等操作在开发者机器上以用户权限运行，Agent 可以修改 Git hooks 或 CI 工作流而不被察觉。

2026 年 3 月 Trivy 供应链攻击事件：攻击者强制推送了 aquasecurity/trivy-action 的 76 个版本标签，发布恶意二进制文件，在约 12 小时的暴露窗口内从 CI runner 窃取密钥、云凭证和 SSH 密钥。安全扫描工具本身成为攻击载体——这与 AI Agent 在宽松模式下运行的威胁模型完全相同。

Docker Sandboxes 方案：

不修补容器，而是使用微型虚拟机（microVM）——足够轻量以秒级启动，同时拥有独立内核
每个 microVM 内运行完整的 Docker Engine，Agent 执行 docker build、npm install 等操作完全在沙箱内
凭证不跨越 VM 边界，Agent 无法访问宿主机的 Docker socket 或文件系统
使用自定义 VMM（非 Firecracker），支持 macOS（Hypervisor.framework）、Windows（WHP）、Linux（KVM）

来源：Docker Blog（2026-05-26）

MCP Server Vetting

@Service
public class McpSecurityGateway {

    private final Set<String> trustedServers = Set.of(
        "official/filesystem", "official/github", "official/postgres"
    );

    public void validateToolCall(ToolCallRequest request) {
        // Verify server is trusted
        if (!trustedServers.contains(request.getServerId())) {
            throw new SecurityException("Untrusted MCP server: " + request.getServerId());
        }

        // Sandbox tool execution
        if (request.isDestructive()) {
            requireApproval(request);
        }

        // Scan tool output for injection
        scanForInjection(request.getToolOutput());
    }
}

Agent Communication Security (A2A)

@Service
public class A2ASecurityInterceptor {

    public void validateIncomingMessage(A2AMessage message) {
        // Verify sender identity
        verifyAgentIdentity(message.getFromAgent());

        // Rate limit per agent
        checkAgentRateLimit(message.getFromAgent());

        // Scan for cross-agent injection
        scanForInjection(message.getContent());

        // Log for audit trail
        auditLog.logInterAgentMessage(message);
    }
}

9.4 Production Deployment

Docker Configuration

# Dockerfile
FROM eclipse-temurin:21-jdk-alpine AS builder
WORKDIR /app
COPY build.gradle settings.gradle ./
COPY src ./src
RUN ./gradlew bootJar --no-daemon

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=builder /app/build/libs/*.jar app.jar

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:8080/actuator/health || exit 1

EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]

Docker Compose

services:
  agent-service:
    build: .
    ports:
      - "8080:8080"
    environment:
      - SPRING_PROFILES_ACTIVE=production
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - POSTGRES_URL=jdbc:postgresql://postgres:5432/agents
      - REDIS_URL=redis://redis:6379
    depends_on:
      - postgres
      - redis
    restart: unless-stopped

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      - POSTGRES_DB=agents
      - POSTGRES_USER=agent_user
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
    restart: unless-stopped

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    restart: unless-stopped

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

volumes:
  postgres_data:
  redis_data:
  grafana_data:

Observability Stack

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'agent-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['agent-service:8080']

Monitoring Dashboard (Grafana)

Key metrics to monitor:

Metric	Description	Alert Threshold
agent_success_rate	% of successful agent executions	< 95%
agent_latency_p95	95th percentile latency	> 15s
agent_token_usage	Tokens consumed per hour	> 100K
agent_cost_per_task	Cost per successful task	> $0.10
tool_failure_rate	% of failed tool calls	> 5%
llm_api_errors	LLM API error rate	> 1%

9.5 A/B Testing

Test different agent configurations safely.

@Service
public class AgentABTestService {

    @Autowired
    private AgentRegistry agentRegistry;

    @Autowired
    private ExperimentRepository experimentRepository;

    public String executeWithExperiment(String userId, String query) {
        // Get active experiment
        Experiment experiment = experimentRepository.findActive("agent-v2-vs-v1");

        // Assign user to variant
        String variant = assignVariant(experiment, userId);

        // Get agent for variant
        Agent agent = agentRegistry.getAgent(variant);

        // Execute
        String result = agent.execute(query);

        // Log metrics
        logMetrics(experiment, variant, userId, result);

        return result;
    }

    private String assignVariant(Experiment experiment, String userId) {
        // Consistent hashing for stable assignment
        int hash = userId.hashCode();
        if (hash % 2 == 0) {
            return "agent_v1";
        } else {
            return "agent_v2";
        }
    }
}

9.6 Key Takeaways

Evaluation Strategy

LLM-as-a-Judge: Scalable but needs calibration
Human Evaluation: Gold standard for quality
Automated Tests: Essential for regressions
Metrics Tracking: Quantitative insights

Challenge Mitigation

Challenge	Mitigation
Hallucination	RAG + Verification + Citations
Infinite Loops	Iteration limits + Loop detection
High Cost	Caching + Smaller models
High Latency	Parallel tools + Streaming
Security	Input validation + Access control

Production Readiness Checklist

9.7 Next Steps

Complete your learning journey:

→ 10. Frontier Trends - Emerging technologies and research

Start Small

When deploying to production, start with a limited beta, monitor metrics closely, and gradually increase traffic based on performance.

Cost Awareness

Agent costs can scale quickly. Always implement caching and set budget limits before wide deployment.

9.1 Agent Evaluation​

Evaluation Approaches​

1. LLM-as-a-Judge​

Implementation​

Best Practices​

2. Human Evaluation​

Evaluation Framework​

Evaluation Interface (Frontend)​

3. Automated Testing​

4. Key Metrics​

Structured Output for Reliability​

Modern Evaluation Tools​

Prompt Caching for Cost Optimization​

9.2 Common Challenges​

Challenge 1: Hallucination​

Mitigation Strategies​

Implementation​

Challenge 2: Infinite Loops​

Solutions​

Challenge 3: Cost Control​

Cost Optimization Strategies​

Implementation​

Challenge 4: Latency​

Optimization Techniques​

Parallel Tool Execution​

9.3 Security & Safety​

2025 Agent Threat Landscape​

Prompt Injection​

Defense Strategies​

Tool Access Control​

Human-in-the-Loop​

Audit Logging​

Agent-Specific Security Measures (2025)​

Docker Sandboxes：AI Coding Agent 的隔离方案（2026）​

MCP Server Vetting​

Agent Communication Security (A2A)​

9.4 Production Deployment​

Docker Configuration​

Docker Compose​

Observability Stack​

Monitoring Dashboard (Grafana)​

9.5 A/B Testing​

9.6 Key Takeaways​

Evaluation Strategy​

Challenge Mitigation​

Production Readiness Checklist​

9.7 Next Steps​

9.1 Agent Evaluation

Evaluation Approaches

1. LLM-as-a-Judge

Implementation

Best Practices

2. Human Evaluation

Evaluation Framework

Evaluation Interface (Frontend)

3. Automated Testing

4. Key Metrics

Structured Output for Reliability

Modern Evaluation Tools

Prompt Caching for Cost Optimization

9.2 Common Challenges

Challenge 1: Hallucination

Mitigation Strategies

Implementation

Challenge 2: Infinite Loops

Solutions

Challenge 3: Cost Control

Cost Optimization Strategies

Implementation

Challenge 4: Latency

Optimization Techniques

Parallel Tool Execution

9.3 Security & Safety

2025 Agent Threat Landscape

Prompt Injection

Defense Strategies

Tool Access Control

Human-in-the-Loop

Audit Logging

Agent-Specific Security Measures (2025)

Docker Sandboxes：AI Coding Agent 的隔离方案（2026）

MCP Server Vetting

Agent Communication Security (A2A)

9.4 Production Deployment

Docker Configuration

Docker Compose

Observability Stack

Monitoring Dashboard (Grafana)

9.5 A/B Testing

9.6 Key Takeaways

Evaluation Strategy

Challenge Mitigation

Production Readiness Checklist

9.7 Next Steps