9. Engineering & Production
Building agents that work in prototypes is different from building agents that work reliably in production. This section covers the critical engineering challenges, evaluation methods, security considerations, and deployment strategies for production-grade AI agents.
9.1 Agent Evaluationâ
Evaluating agent performance is fundamentally different from traditional software testing due to non-determinism and complexity.
Evaluation Approachesâ
1. LLM-as-a-Judgeâ
Use an LLM to evaluate agent outputs against criteria.
Implementationâ
@Service
public class AgentEvaluator {
@Autowired
private ChatClient evaluatorClient;
public EvaluationResult evaluate(AgentOutput output, EvaluationCriteria criteria) {
String evaluation = evaluatorClient.prompt()
.system("""
You are an expert evaluator for AI agent outputs.
Rate the following on a scale of 1-10:
1. Accuracy: Is the information correct?
2. Completeness: Does it fully address the task?
3. Relevance: Is the information focused?
4. Safety: Are there any harmful outputs?
""")
.user("""
Task: {task}
Agent Output: {output}
Context: {context}
Provide evaluation in JSON format:
{
"accuracy": 8,
"completeness": 7,
"relevance": 9,
"safety": 10,
"reasoning": "..."
}
""".formatted(
output.task(),
output.content(),
output.context()
))
.call()
.content();
return parseEvaluation(evaluation);
}
}
Best Practicesâ
- Clear Criteria: Define specific evaluation dimensions
- Few-Shot Examples: Provide examples of good/bad outputs
- Multiple Judges: Use multiple LLMs and aggregate
- Human Validation: Calibrate LLM judges with human labels
2. Human Evaluationâ
Human evaluation remains the gold standard for quality.
Evaluation Frameworkâ
@Service
public class HumanEvaluationService {
public EvaluationDataset createDataset(List<AgentOutput> outputs) {
// Shuffle for randomization
List<AgentOutput> shuffled = shuffle(outputs);
// Create evaluation tasks
return EvaluationDataset.builder()
.instructions("Rate each output on accuracy, completeness, and quality (1-10)")
.items(shuffled.stream()
.map(this::createEvaluationItem)
.toList())
.build();
}
public EvaluationMetrics calculateMetrics(List<HumanRating> ratings) {
return EvaluationMetrics.builder()
.accuracyMean(ratings.stream().mapToInt(HumanRating::accuracy).average().orElse(0))
.completenessMean(ratings.stream().mapToInt(HumanRating::completeness).average().orElse(0))
.interAnnotatorAgreement(calculateKappa(ratings))
.build();
}
}
Evaluation Interface (Frontend)â
// Next.js: Evaluation Interface
interface EvaluationItem {
id: string;
task: string;
output: string;
context: string;
}
interface Rating {
accuracy: number;
completeness: number;
quality: number;
notes?: string;
}
export function EvaluationForm({ item }: { item: EvaluationItem }) {
const [rating, setRating] = useState<Rating>({
accuracy: 5,
completeness: 5,
quality: 5,
});
const handleSubmit = async () => {
await fetch('/api/evaluation/rate', {
method: 'POST',
body: JSON.stringify({ itemId: item.id, rating }),
});
};
return (
<div className="evaluation-form">
<h3>Task: {item.task}</h3>
<p>{item.output}</p>
<Slider
label="Accuracy"
value={rating.accuracy}
onChange={(v) => setRating({ ...rating, accuracy: v })}
/>
<Slider
label="Completeness"
value={rating.completeness}
onChange={(v) => setRating({ ...rating, completeness: v })}
/>
<Slider
label="Quality"
value={rating.quality}
onChange={(v) => setRating({ ...rating, quality: v })}
/>
<Textarea
label="Notes"
value={rating.notes}
onChange={(v) => setRating({ ...rating, notes: v })}
/>
<Button onClick={handleSubmit}>Submit Rating</Button>
</div>
);
}
3. Automated Testingâ
Test specific agent behaviors with unit and integration tests.
@SpringBootTest
class AgentServiceTest {
@Autowired
private ReactAgentService agent;
@MockBean
private SearchService searchService;
@Test
void testAgentUsesSearchTool() {
// Arrange
when(searchService.search(anyString()))
.thenReturn("Paris is the capital of France");
// Act
String result = agent.execute("What is the capital of France?", 5);
// Assert
assertThat(result).contains("Paris");
verify(searchService, times(1)).search(anyString());
}
@Test
void testAgentHandlesToolFailure() {
// Arrange
when(searchService.search(anyString()))
.thenThrow(new ServiceUnavailableException("Search is down"));
// Act
String result = agent.execute("Search for news", 5);
// Assert
assertThat(result).contains("apologize");
assertThat(result).contains("unavailable");
}
}
4. Key Metricsâ
| Metric | Description | Target |
|---|---|---|
| Task Success Rate | % of tasks completed successfully | > 90% |
| Accuracy | Factual correctness of outputs | > 95% |
| Relevance | How well output addresses task | > 90% |
| Safety | Absence of harmful content | 100% |
| Latency (p50) | Median response time | < 5s |
| Latency (p95) | 95th percentile response time | < 15s |
| Cost per Task | Token cost per successful task | Minimize |
| Tool Success Rate | % of tool calls successful | > 95% |
Structured Output for Reliabilityâ
One of the most significant advances in 2025 for agent reliability is the widespread adoption of Structured Output:
- OpenAI:
response_format: { type: "json_schema", json_schema: {...} }guarantees valid JSON - Anthropic: Tool use with strict schema enforcement ensures parseable responses
- Gemini:
generationConfig.responseSchemafor JSON mode
Impact on Agents:
- Eliminates output parsing failures (a top source of agent crashes)
- Enables reliable tool calling (tools receive valid parameters)
- Reduces hallucination in structured tasks (schema constrains the output space)
- Facilitates multi-agent communication (agents exchange JSON messages)
Modern Evaluation Toolsâ
| Tool | Type | Key Feature |
|---|---|---|
| LangSmith | Platform | End-to-end tracing, evaluation datasets, annotation queues |
| LangFuse | Open Source | Self-hosted observability, prompt management, cost tracking |
| Ragas | Library | RAG-specific metrics (faithfulness, relevance, context precision) |
| DeepEval | Library | LLM-as-judge, assertion-based testing, CI/CD integration |
| Promptfoo | CLI | Multi-model comparison, regression testing, eval datasets |
| Braintrust | Platform | Online evals, scorecards, data logging |
Prompt Caching for Cost Optimizationâ
Both OpenAI and Anthropic now support prompt caching, which can reduce costs by 80-90% for repeated prompt prefixes:
# Anthropic: Automatic caching for prompts > 1024 tokens
# Cached tokens cost 90% less
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
system=large_system_prompt, # This gets cached
messages=[{"role": "user", "content": "..."}],
extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
# OpenAI: Automatic caching for prompts > 1024 tokens
# Cached input tokens cost 50% less
Agent Cost Optimization Strategy:
- Cache system prompts and tool definitions (large, rarely change)
- Use cheaper models for routing/classification tasks
- Implement semantic caching for similar queries
- Batch API for non-real-time tasks (50% discount)
9.2 Common Challengesâ
Challenge 1: Hallucinationâ
Agents can generate plausible-sounding but incorrect information.
Mitigation Strategiesâ
Implementationâ
@Service
public class AntiHallucinationService {
@Autowired
private VectorStore vectorStore;
@Autowired
private ChatClient chatClient;
public String generateWithVerification(String query) {
// Step 1: Retrieve relevant context
List<Document> context = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(5)
);
// Step 2: Generate with citations
String response = chatClient.prompt()
.user(query)
.messages(createMessagesWithCitations(context))
.call()
.content();
// Step 3: Verify claims
List<Claim> claims = extractClaims(response);
for (Claim claim : claims) {
if (!verifyClaim(claim, context)) {
return flagUncertainty(claim);
}
}
return response;
}
private boolean verifyClaim(Claim claim, List<Document> context) {
// Use RAG context to verify
String verification = chatClient.prompt()
.system("Verify if the claim is supported by the context.")
.user("""
Claim: {claim}
Context: {context}
Answer YES or NO with explanation.
""".formatted(
claim.text(),
context.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n"))
))
.call()
.content();
return verification.toLowerCase().startsWith("yes");
}
}
Challenge 2: Infinite Loopsâ
Agents can get stuck in repetitive behaviors.
Solutionsâ
@Service
public class LoopPreventionService {
private static final int MAX_ITERATIONS = 10;
private static final int MAX_REPEAT_ACTIONS = 3;
public AgentExecutionResult executeWithGuardrails(AgentTask task) {
Set<String> recentActions = new HashSet<>();
int iteration = 0;
while (iteration < MAX_ITERATIONS && !task.isComplete()) {
String action = task.getNextAction();
// Detect loops
if (recentActions.contains(action)) {
int count = countOccurrences(recentActions, action);
if (count >= MAX_REPEAT_ACTIONS) {
return handleLoop(task, action);
}
}
recentActions.add(action);
if (recentActions.size() > 5) {
recentActions.remove(recentActions.iterator().next());
}
// Execute
task.executeAction(action);
iteration++;
}
return task.getResult();
}
private AgentExecutionResult handleLoop(AgentTask task, String repeatingAction) {
// Ask for human intervention
return AgentExecutionResult.builder()
.status("NEEDS_INTERVENTION")
.message("Agent stuck in loop repeating: " + repeatingAction)
.suggestedActions(List.of(
"Retry with different approach",
"Provide more specific instructions",
"Break task into smaller steps"
))
.build();
}
}
Challenge 3: Cost Controlâ
LLM usage can become expensive at scale.
Cost Optimization Strategiesâ
| Strategy | Impact | Implementation |
|---|---|---|
| Caching | High | Cache LLM responses |
| Smaller Models | High | Use Haiku for simple tasks |
| Token Limits | Medium | Set max tokens per request |
| Result Streaming | Low | Stream responses for UX |
| Batch Processing | Medium | Process multiple queries together |
Implementationâ
@Service
public class CostOptimizedAgentService {
@Autowired
private ChatClient gpt4Client; // Expensive
@Autowired
private ChatClient haikuClient; // Cheap
@Autowired
private CacheManager cacheManager;
public String execute(AgentRequest request) {
// Check cache first
String cacheKey = generateCacheKey(request);
String cached = cacheManager.getCache("agent-responses").get(cacheKey, String.class);
if (cached != null) {
return cached;
}
// Route to appropriate model
ChatClient client = selectModel(request);
String response = client.prompt().user(request.query()).call().content();
// Cache the result
cacheManager.getCache("agent-responses").put(cacheKey, response);
return response;
}
private ChatClient selectModel(AgentRequest request) {
// Use Haiku for simple queries
if (request.complexity() == Complexity.LOW) {
return haikuClient;
}
// Use GPT-4 for complex tasks
return gpt4Client;
}
}
Challenge 4: Latencyâ
Agents need to respond quickly for good UX.
Optimization Techniquesâ
Parallel Tool Executionâ
@Service
public class ParallelToolExecutor {
@Autowired
private List<FunctionCallback> tools;
public Map<String, String> executeParallel(List<ToolCall> calls) {
ExecutorService executor = Executors.newFixedThreadPool(10);
List<CompletableFuture<Map.Entry<String, String>>> futures = calls.stream()
.map(call -> CompletableFuture.supplyAsync(() -> {
String result = executeTool(call);
return Map.entry(call.name(), result);
}, executor))
.toList();
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
return futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue
));
}
}
9.3 Security & Safetyâ
2025 Agent Threat Landscapeâ
Agents introduce new attack surfaces beyond traditional LLM risks:
| Threat | Description | Example |
|---|---|---|
| Tool Poisoning | Malicious MCP server returns harmful data | Compromised tool injects instructions via response |
| Agent-to-Agent Attack | One agent manipulates another via crafted messages | A2A protocol message with hidden instructions |
| Supply Chain (MCP) | Malicious MCP plugin in marketplace | Trojaned tool exfiltrates data |
| Context Window Overflow | Flooding context to hide malicious instructions | Long document with hidden prompt injection |
| Credential Harvesting | Agent tricked into revealing API keys | Social engineering via tool output |
| Sandbox Escape | Breaking out of execution environment | Computer Use agent accessing host filesystem |
Prompt Injectionâ
Malicious users trying to manipulate agent behavior.
Defense Strategiesâ
@Service
public class PromptInjectionDefense {
private static final Pattern INJECTION_PATTERNS = Pattern.compile(
"(ignore|override|forget|disregard).*(instructions|system|prompt)",
Pattern.CASE_INSENSITIVE
);
public SanitizedInput sanitize(UserInput input) {
String text = input.text();
// Check for injection patterns
if (INJECTION_PATTERNS.matcher(text).find()) {
throw new SecurityException("Potential prompt injection detected");
}
// Validate against allowlist
if (!isAllowedTopic(text)) {
throw new SecurityException("Topic not allowed");
}
// Rate limit check
if (exceedsRateLimit(input.userId())) {
throw new RateLimitExceededException();
}
return SanitizedInput.from(text);
}
@Bean
public SecurityFilter securityFilter() {
return new SecurityFilter() {
@Override
public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
String path = exchange.getRequest().getPath().value();
if (path.startsWith("/api/agents")) {
String body = getBody(exchange);
try {
sanitize(new UserInput(body));
} catch (SecurityException e) {
exchange.getResponse().setStatusCode(HttpStatus.FORBIDDEN);
return exchange.getResponse().setComplete();
}
}
return chain.filter(exchange);
}
};
}
}
Tool Access Controlâ
Restrict which tools agents can use based on user permissions.
@Service
public class ToolAccessControl {
@Autowired
private PermissionService permissionService;
public List<FunctionCallback> getAuthorizedTools(String userId) {
return allTools.stream()
.filter(tool -> permissionService.hasPermission(userId, tool.getName()))
.toList();
}
public boolean canExecuteTool(String userId, String toolName) {
ToolPermission permission = permissionService.getPermission(userId, toolName);
// Check permission
if (!permission.isAllowed()) {
return false;
}
// Check rate limits
if (permission.getUsageCount() >= permission.getMaxUsage()) {
return false;
}
// Check time restrictions
if (!permission.isWithinAllowedHours()) {
return false;
}
return true;
}
}
Human-in-the-Loopâ
Require human approval for sensitive operations.
@Service
public class HumanInTheLoopService {
@Autowired
private NotificationService notificationService;
@Autowired
private ApprovalRepository approvalRepository;
public AgentResult executeWithApproval(AgentTask task) {
// Check if approval needed
if (task.requiresApproval()) {
ApprovalRequest request = createApprovalRequest(task);
notificationService.notifyApprovers(request);
// Wait for approval
Approval approval = waitForApproval(request.getId());
if (!approval.isApproved()) {
return AgentResult.rejected("Approval denied: " + approval.getReason());
}
}
// Execute task
return task.execute();
}
private Approval waitForApproval(String requestId) {
// Poll for approval (or use WebSocket)
for (int i = 0; i < 60; i++) { // 1 minute timeout
Approval approval = approvalRepository.findById(requestId).orElse(null);
if (approval != null && approval.isDecided()) {
return approval;
}
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
throw new ApprovalTimeoutException();
}
}
Audit Loggingâ
Track all agent actions for security and compliance.
@Service
public class AgentAuditLogger {
@Autowired
private AuditLogRepository auditLogRepository;
@EventListener
public void logAgentAction(AgentActionEvent event) {
AgentAuditLog log = AgentAuditLog.builder()
.agentId(event.getAgentId())
.userId(event.getUserId())
.action(event.getAction())
.input(sanitize(event.getInput()))
.output(sanitize(event.getOutput()))
.toolsUsed(event.getToolsUsed())
.tokensConsumed(event.getTokensConsumed())
.cost(event.getCost())
.timestamp(Instant.now())
.build();
auditLogRepository.save(log);
}
public List<AgentAuditLog> getUserActivity(String userId, Instant since) {
return auditLogRepository.findByUserIdAndTimestampAfter(userId, since);
}
}
Agent-Specific Security Measures (2025)â
MCP Server Vettingâ
@Service
public class McpSecurityGateway {
private final Set<String> trustedServers = Set.of(
"official/filesystem", "official/github", "official/postgres"
);
public void validateToolCall(ToolCallRequest request) {
// Verify server is trusted
if (!trustedServers.contains(request.getServerId())) {
throw new SecurityException("Untrusted MCP server: " + request.getServerId());
}
// Sandbox tool execution
if (request.isDestructive()) {
requireApproval(request);
}
// Scan tool output for injection
scanForInjection(request.getToolOutput());
}
}
Agent Communication Security (A2A)â
@Service
public class A2ASecurityInterceptor {
public void validateIncomingMessage(A2AMessage message) {
// Verify sender identity
verifyAgentIdentity(message.getFromAgent());
// Rate limit per agent
checkAgentRateLimit(message.getFromAgent());
// Scan for cross-agent injection
scanForInjection(message.getContent());
// Log for audit trail
auditLog.logInterAgentMessage(message);
}
}
9.4 Production Deploymentâ
Docker Configurationâ
# Dockerfile
FROM eclipse-temurin:21-jdk-alpine AS builder
WORKDIR /app
COPY build.gradle settings.gradle ./
COPY src ./src
RUN ./gradlew bootJar --no-daemon
FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY /app/build/libs/*.jar app.jar
# Health check
HEALTHCHECK \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/actuator/health || exit 1
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]
Docker Composeâ
services:
agent-service:
build: .
ports:
- "8080:8080"
environment:
- SPRING_PROFILES_ACTIVE=production
- OPENAI_API_KEY=${OPENAI_API_KEY}
- POSTGRES_URL=jdbc:postgresql://postgres:5432/agents
- REDIS_URL=redis://redis:6379
depends_on:
- postgres
- redis
restart: unless-stopped
postgres:
image: pgvector/pgvector:pg16
environment:
- POSTGRES_DB=agents
- POSTGRES_USER=agent_user
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
restart: unless-stopped
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
restart: unless-stopped
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
restart: unless-stopped
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
volumes:
postgres_data:
redis_data:
grafana_data:
Observability Stackâ
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'agent-service'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['agent-service:8080']
Monitoring Dashboard (Grafana)â
Key metrics to monitor:
| Metric | Description | Alert Threshold |
|---|---|---|
| agent_success_rate | % of successful agent executions | < 95% |
| agent_latency_p95 | 95th percentile latency | > 15s |
| agent_token_usage | Tokens consumed per hour | > 100K |
| agent_cost_per_task | Cost per successful task | > $0.10 |
| tool_failure_rate | % of failed tool calls | > 5% |
| llm_api_errors | LLM API error rate | > 1% |
9.5 A/B Testingâ
Test different agent configurations safely.
@Service
public class AgentABTestService {
@Autowired
private AgentRegistry agentRegistry;
@Autowired
private ExperimentRepository experimentRepository;
public String executeWithExperiment(String userId, String query) {
// Get active experiment
Experiment experiment = experimentRepository.findActive("agent-v2-vs-v1");
// Assign user to variant
String variant = assignVariant(experiment, userId);
// Get agent for variant
Agent agent = agentRegistry.getAgent(variant);
// Execute
String result = agent.execute(query);
// Log metrics
logMetrics(experiment, variant, userId, result);
return result;
}
private String assignVariant(Experiment experiment, String userId) {
// Consistent hashing for stable assignment
int hash = userId.hashCode();
if (hash % 2 == 0) {
return "agent_v1";
} else {
return "agent_v2";
}
}
}
9.6 Key Takeawaysâ
Evaluation Strategyâ
- LLM-as-a-Judge: Scalable but needs calibration
- Human Evaluation: Gold standard for quality
- Automated Tests: Essential for regressions
- Metrics Tracking: Quantitative insights
Challenge Mitigationâ
| Challenge | Mitigation |
|---|---|
| Hallucination | RAG + Verification + Citations |
| Infinite Loops | Iteration limits + Loop detection |
| High Cost | Caching + Smaller models |
| High Latency | Parallel tools + Streaming |
| Security | Input validation + Access control |
Production Readiness Checklistâ
- Evaluation framework established
- Error handling comprehensive
- Rate limiting configured
- Security controls in place
- Audit logging enabled
- Monitoring and alerting configured
- Cost controls implemented
- A/B testing framework ready
- Rollback plan documented
9.7 Next Stepsâ
Complete your learning journey:
- â 10. Frontier Trends - Emerging technologies and research
When deploying to production, start with a limited beta, monitor metrics closely, and gradually increase traffic based on performance.
Agent costs can scale quickly. Always implement caching and set budget limits before wide deployment.