6 Evaluation & Version Control

Why Evaluation Matters

Without measurement, prompt engineering is guesswork. Production AI systems require systematic evaluation, version control, and continuous improvement — just like traditional software.

The Evaluation Gap

Traditional Software:                  AI/Prompt Development:
┌─────────────────────────────┐        ┌─────────────────────────────┐
│ ✅ Unit tests               │        │ ❌ "It looks good"          │
│ ✅ Integration tests        │        │ ❌ Manual spot checks       │
│ ✅ Coverage metrics         │        │ ❌ Vibes-based iteration    │
│ ✅ CI/CD gates              │        │ ❌ Ship and pray            │
│ ✅ Performance benchmarks   │        │ ❌ Unknown regressions      │
└─────────────────────────────┘        └─────────────────────────────┘

        Good Engineering          vs         "Prompt Vibes"

The Professional Approach

Systematic Prompt Engineering:
┌──────────────────────────────────────────────────────────────────────┐
│  Define → Measure → Iterate → Validate → Deploy → Monitor → Repeat  │
├──────────────────────────────────────────────────────────────────────┤
│  ✅ Evaluation datasets with ground truth                           │
│  ✅ Automated metrics (accuracy, relevance, coherence)              │
│  ✅ LLM-as-Judge for subjective quality                             │
│  ✅ A/B testing infrastructure                                       │
│  ✅ Version control for prompts                                      │
│  ✅ CI/CD quality gates                                              │
│  ✅ Production monitoring and alerting                               │
└──────────────────────────────────────────────────────────────────────┘

1. Evaluation Fundamentals

1.1 What is an Eval?

An eval (evaluation) is a structured test measuring prompt performance on a specific task. It consists of:

Eval Components:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  1. DATASET                2. METRIC               3. THRESHOLD     │
│  ┌─────────────────┐       ┌─────────────────┐     ┌─────────────┐ │
│  │ Input: "What is │       │ Accuracy: 95%   │     │ Pass: >90%  │ │
│  │   the capital   │  →    │ Relevance: 0.87 │  →  │ Fail: <90%  │ │
│  │   of France?"   │       │ Latency: 1.2s   │     │             │ │
│  │ Expected: Paris │       │                 │     │             │ │
│  └─────────────────┘       └─────────────────┘     └─────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1.2 Types of Evaluation

Type	Description	When to Use
Offline Eval	Batch evaluation on test dataset	Development, CI/CD
Online Eval	A/B testing with real users	Production validation
LLM-as-Judge	Another LLM evaluates responses	No ground truth available
Human Eval	Expert human annotation	Gold standard, calibration
Automated Metrics	BLEU, ROUGE, BERTScore	Translation, summarization

1.3 Evaluation Dataset Design

Dataset Size Guidelines

Minimum dataset sizes vary by task complexity. Too small = unreliable metrics. Too large = wasted resources.

Task Type	Minimum Samples	Recommended	Notes
Binary Classification	100	500+	Balance classes
Multi-class (5 classes)	200	1000+	40+ per class
Open-ended Generation	50	200+	Diverse scenarios
RAG Evaluation	100	300+	Varied query types
Summarization	50	150+	Different document lengths
Code Generation	100	500+	Cover edge cases

Dataset Structure:

{
  "dataset_id": "customer-support-v2",
  "created": "2025-01-21",
  "task_type": "classification",
  "samples": [
    {
      "id": "cs-001",
      "input": "My order hasn't arrived yet, it's been 2 weeks",
      "expected_output": "shipping_delay",
      "metadata": {
        "category": "shipping",
        "difficulty": "easy",
        "source": "production_logs"
      }
    },
    {
      "id": "cs-002",
      "input": "I want to return this item but the return button doesn't work",
      "expected_output": "return_technical_issue",
      "metadata": {
        "category": "returns",
        "difficulty": "medium",
        "source": "manual_annotation"
      }
    }
  ]
}

2. Evaluation Metrics Deep Dive

2.1 Classification Metrics

public class ClassificationMetrics {

    public static double accuracy(List<Prediction> predictions) {
        long correct = predictions.stream()
            .filter(p -> p.predicted().equals(p.expected()))
            .count();
        return (double) correct / predictions.size();
    }

    public static double precision(List<Prediction> predictions, String positiveClass) {
        long truePositives = predictions.stream()
            .filter(p -> p.predicted().equals(positiveClass) &&
                        p.expected().equals(positiveClass))
            .count();
        long predictedPositives = predictions.stream()
            .filter(p -> p.predicted().equals(positiveClass))
            .count();
        return predictedPositives == 0 ? 0 : (double) truePositives / predictedPositives;
    }

    public static double recall(List<Prediction> predictions, String positiveClass) {
        long truePositives = predictions.stream()
            .filter(p -> p.predicted().equals(positiveClass) &&
                        p.expected().equals(positiveClass))
            .count();
        long actualPositives = predictions.stream()
            .filter(p -> p.expected().equals(positiveClass))
            .count();
        return actualPositives == 0 ? 0 : (double) truePositives / actualPositives;
    }

    public static double f1Score(double precision, double recall) {
        if (precision + recall == 0) return 0;
        return 2 * (precision * recall) / (precision + recall);
    }
}

2.2 Text Generation Metrics

Metric	Formula/Description	Best For	Limitations
BLEU	N-gram precision overlap	Translation	Penalizes paraphrasing
ROUGE-N	N-gram recall overlap	Summarization	Ignores semantics
ROUGE-L	Longest common subsequence	Summarization	Order-sensitive
BERTScore	Semantic embedding similarity	Any generation	Compute intensive
METEOR	Harmonic mean with synonyms	Translation	Requires resources

Implementation:

# Using evaluate library
import evaluate

# BLEU Score
bleu = evaluate.load("bleu")
results = bleu.compute(
    predictions=["The cat sat on the mat"],
    references=[["The cat is on the mat"]]
)
print(f"BLEU: {results['bleu']:.3f}")

# ROUGE Score
rouge = evaluate.load("rouge")
results = rouge.compute(
    predictions=["AI is transforming healthcare"],
    references=["Artificial intelligence is revolutionizing the healthcare industry"]
)
print(f"ROUGE-L: {results['rougeL']:.3f}")

# BERTScore (semantic similarity)
bertscore = evaluate.load("bertscore")
results = bertscore.compute(
    predictions=["The weather is nice today"],
    references=["It's a beautiful day outside"],
    lang="en"
)
print(f"BERTScore F1: {results['f1'][0]:.3f}")

2.3 RAG-Specific Metrics

public class RagMetrics {

    /**
     * Measures how much of the retrieved context is relevant to the query
     */
    public static double contextRelevance(
            String query,
            List<Document> retrievedDocs,
            EmbeddingModel embeddingModel) {

        float[] queryEmbedding = embeddingModel.embed(query);

        return retrievedDocs.stream()
            .mapToDouble(doc -> {
                float[] docEmbedding = embeddingModel.embed(doc.getContent());
                return cosineSimilarity(queryEmbedding, docEmbedding);
            })
            .average()
            .orElse(0.0);
    }

    /**
     * Measures how well the answer is grounded in the retrieved context
     */
    public static double faithfulness(
            String answer,
            List<Document> context,
            ChatClient judgeClient) {

        String prompt = """
            Given the context and answer below, rate how well the answer
            is supported by the context on a scale of 0-1.

            Context:
            %s

            Answer:
            %s

            Return only a number between 0 and 1.
            """.formatted(
                context.stream().map(Document::getContent).collect(joining("\n\n")),
                answer
            );

        String score = judgeClient.prompt().user(prompt).call().content();
        return Double.parseDouble(score.trim());
    }

    /**
     * Measures if the answer actually addresses the question
     */
    public static double answerRelevance(
            String query,
            String answer,
            ChatClient judgeClient) {

        String prompt = """
            Rate how well this answer addresses the question on a scale of 0-1.

            Question: %s
            Answer: %s

            Return only a number between 0 and 1.
            """.formatted(query, answer);

        String score = judgeClient.prompt().user(prompt).call().content();
        return Double.parseDouble(score.trim());
    }
}

2.4 RAG Evaluation Framework (RAGAS-style)

RAG Evaluation Dimensions:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  ┌─────────────────┐     ┌─────────────────┐     ┌───────────────┐ │
│  │ Context         │     │ Faithfulness    │     │ Answer        │ │
│  │ Relevance       │     │                 │     │ Relevance     │ │
│  │                 │     │                 │     │               │ │
│  │ "Are retrieved  │     │ "Is the answer  │     │ "Does answer  │ │
│  │  docs relevant  │     │  grounded in    │     │  address the  │ │
│  │  to query?"     │     │  context?"      │     │  question?"   │ │
│  └────────┬────────┘     └────────┬────────┘     └───────┬───────┘ │
│           │                       │                      │         │
│           └───────────────────────┼──────────────────────┘         │
│                                   ▼                                 │
│                    ┌─────────────────────────────┐                 │
│                    │   Overall RAG Score         │                 │
│                    │   = weighted average        │                 │
│                    └─────────────────────────────┘                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

3. LLM-as-Judge Evaluation

When ground truth doesn't exist or is subjective, use another LLM to evaluate.

3.1 Single-Point Grading

@Service
public class LlmJudgeService {

    private final ChatClient judgeClient;

    public EvaluationResult evaluateResponse(
            String query,
            String response,
            List<String> criteria) {

        String criteriaList = criteria.stream()
            .map(c -> "- " + c)
            .collect(Collectors.joining("\n"));

        String prompt = """
            You are an expert evaluator. Rate the following response.

            ## Query
            %s

            ## Response
            %s

            ## Evaluation Criteria
            %s

            ## Instructions
            For each criterion, provide:
            1. Score (1-5, where 5 is excellent)
            2. Brief justification

            Return your evaluation as JSON:
            {
              "scores": {
                "criterion_name": {"score": X, "reason": "..."}
              },
              "overall_score": X.X,
              "summary": "Overall assessment..."
            }
            """.formatted(query, response, criteriaList);

        String result = judgeClient.prompt()
            .user(prompt)
            .call()
            .content();

        return parseEvaluationResult(result);
    }
}

3.2 Pairwise Comparison

public class PairwiseJudge {

    private final ChatClient judgeClient;

    public ComparisonResult compare(
            String query,
            String responseA,
            String responseB) {

        String prompt = """
            Compare these two responses to the same query.

            ## Query
            %s

            ## Response A
            %s

            ## Response B
            %s

            ## Instructions
            Which response is better? Consider:
            - Accuracy and correctness
            - Completeness
            - Clarity and helpfulness
            - Conciseness

            Return JSON:
            {
              "winner": "A" or "B" or "tie",
              "confidence": 0.0-1.0,
              "reasoning": "..."
            }
            """.formatted(query, responseA, responseB);

        // Reduce position bias by also testing reverse order
        String promptReversed = prompt
            .replace("Response A", "Response X")
            .replace("Response B", "Response A")
            .replace("Response X", "Response B");

        String result1 = judgeClient.prompt().user(prompt).call().content();
        String result2 = judgeClient.prompt().user(promptReversed).call().content();

        return reconcileResults(result1, result2);
    }
}

3.3 Reference-Based Grading

public class ReferenceGrader {

    private final ChatClient judgeClient;

    public GradingResult gradeWithReference(
            String query,
            String response,
            String referenceAnswer) {

        String prompt = """
            Grade this response against the reference answer.

            ## Query
            %s

            ## Student Response
            %s

            ## Reference Answer
            %s

            ## Grading Rubric
            - 5: Equivalent or better than reference
            - 4: Mostly correct, minor omissions
            - 3: Partially correct, some errors
            - 2: Significant errors or missing content
            - 1: Incorrect or irrelevant

            Return JSON:
            {
              "grade": X,
              "correct_elements": ["..."],
              "missing_elements": ["..."],
              "errors": ["..."],
              "feedback": "..."
            }
            """.formatted(query, response, referenceAnswer);

        return parseGradingResult(
            judgeClient.prompt().user(prompt).call().content()
        );
    }
}

3.4 Multi-Judge Ensemble

@Service
public class EnsembleJudge {

    private final List<ChatClient> judges;  // Different models

    public EnsembleResult evaluate(String query, String response) {
        List<Double> scores = judges.parallelStream()
            .map(judge -> evaluateWithJudge(judge, query, response))
            .toList();

        double mean = scores.stream().mapToDouble(d -> d).average().orElse(0);
        double variance = scores.stream()
            .mapToDouble(s -> Math.pow(s - mean, 2))
            .average()
            .orElse(0);

        return new EnsembleResult(
            mean,
            Math.sqrt(variance),  // Standard deviation
            scores,
            variance > 0.5 ? "High disagreement - needs human review" : "Consistent"
        );
    }

    private double evaluateWithJudge(ChatClient judge, String query, String response) {
        // Same evaluation prompt for all judges
        String prompt = createEvaluationPrompt(query, response);
        return Double.parseDouble(judge.prompt().user(prompt).call().content().trim());
    }
}

4. A/B Testing Infrastructure

4.1 Experiment Framework

@Component
public class PromptExperimentService {

    private final ExperimentRepository experimentRepo;
    private final MetricsCollector metricsCollector;
    private final Map<String, ChatClient> variants;

    public ExperimentResult runExperiment(
            String experimentId,
            String userId,
            String query) {

        Experiment experiment = experimentRepo.findById(experimentId)
            .orElseThrow(() -> new ExperimentNotFoundException(experimentId));

        // Deterministic assignment based on user ID
        String variantId = assignVariant(userId, experiment);
        ChatClient client = variants.get(variantId);

        // Execute and measure
        long startTime = System.currentTimeMillis();
        String response = client.prompt().user(query).call().content();
        long latency = System.currentTimeMillis() - startTime;

        // Record metrics
        metricsCollector.record(ExperimentMetric.builder()
            .experimentId(experimentId)
            .variantId(variantId)
            .userId(userId)
            .query(query)
            .response(response)
            .latencyMs(latency)
            .timestamp(Instant.now())
            .build());

        return new ExperimentResult(variantId, response, latency);
    }

    private String assignVariant(String userId, Experiment experiment) {
        // Consistent hashing for stable assignment
        int hash = Math.abs(userId.hashCode() % 100);
        int cumulative = 0;

        for (Variant variant : experiment.getVariants()) {
            cumulative += variant.getTrafficPercentage();
            if (hash < cumulative) {
                return variant.getId();
            }
        }

        return experiment.getVariants().get(0).getId();  // Fallback
    }
}

4.2 Experiment Configuration

# experiments/chat-prompt-v2.yaml
experiment:
  id: "chat-prompt-v2-test"
  name: "Test new system prompt"
  description: "Compare concise vs detailed system prompts"
  start_date: "2025-01-21"
  end_date: "2025-02-21"

  variants:
    - id: "control"
      name: "Current Production"
      traffic_percentage: 50
      prompt_version: "chat-v1.0"

    - id: "treatment"
      name: "New Concise Prompt"
      traffic_percentage: 50
      prompt_version: "chat-v2.0"

  metrics:
    primary:
      - name: "user_satisfaction"
        type: "thumbs_up_rate"
        minimum_improvement: 0.05  # 5% improvement needed

    secondary:
      - name: "response_latency_p95"
        type: "latency_percentile"
        threshold_ms: 3000

      - name: "token_usage"
        type: "average_tokens"

      - name: "task_completion_rate"
        type: "conversion"

  guardrails:
    min_sample_size: 1000
    max_degradation: 0.10  # Stop if 10% worse
    confidence_level: 0.95

4.3 Statistical Analysis

@Service
public class ExperimentAnalyzer {

    public AnalysisResult analyze(String experimentId) {
        List<ExperimentMetric> controlMetrics = metricsRepo
            .findByExperimentAndVariant(experimentId, "control");
        List<ExperimentMetric> treatmentMetrics = metricsRepo
            .findByExperimentAndVariant(experimentId, "treatment");

        // Sample size check
        if (controlMetrics.size() < 1000 || treatmentMetrics.size() < 1000) {
            return AnalysisResult.insufficientData();
        }

        // Calculate metrics
        double controlSatisfaction = calculateSatisfactionRate(controlMetrics);
        double treatmentSatisfaction = calculateSatisfactionRate(treatmentMetrics);

        // Statistical significance (two-proportion z-test)
        double zScore = calculateZScore(
            controlSatisfaction, controlMetrics.size(),
            treatmentSatisfaction, treatmentMetrics.size()
        );
        double pValue = calculatePValue(zScore);

        // Effect size
        double relativeImprovement =
            (treatmentSatisfaction - controlSatisfaction) / controlSatisfaction;

        return AnalysisResult.builder()
            .controlMetric(controlSatisfaction)
            .treatmentMetric(treatmentSatisfaction)
            .absoluteDifference(treatmentSatisfaction - controlSatisfaction)
            .relativeImprovement(relativeImprovement)
            .pValue(pValue)
            .isSignificant(pValue < 0.05)
            .recommendation(generateRecommendation(pValue, relativeImprovement))
            .build();
    }

    private String generateRecommendation(double pValue, double improvement) {
        if (pValue >= 0.05) {
            return "CONTINUE - Not yet statistically significant";
        }
        if (improvement > 0.05) {
            return "SHIP - Significant positive improvement";
        }
        if (improvement < -0.05) {
            return "ROLLBACK - Significant negative impact";
        }
        return "NO_CHANGE - Difference too small to matter";
    }
}

5. Prompt Version Control

5.1 File-Based Version Control

prompts/
├── system/
│   ├── customer-support/
│   │   ├── v1.0.yaml
│   │   ├── v1.1.yaml
│   │   └── v2.0.yaml
│   └── code-assistant/
│       └── v1.0.yaml
├── tasks/
│   ├── summarization/
│   │   └── v1.0.yaml
│   └── classification/
│       └── v1.0.yaml
└── experiments/
    ├── exp-001-concise-prompt/
    │   ├── control.yaml
    │   └── treatment.yaml
    └── exp-002-few-shot/
        ├── zero-shot.yaml
        └── three-shot.yaml

5.2 Prompt Template Schema

# prompts/system/customer-support/v2.0.yaml
metadata:
  id: "customer-support-v2.0"
  version: "2.0.0"
  created: "2025-01-21"
  author: "ai-team"
  status: "production"  # draft, staging, production, deprecated
  parent_version: "1.1.0"

  change_log: |
    - Added product return handling
    - Improved tone for frustrated customers
    - Reduced response length by 20%

  evaluation:
    dataset: "customer-support-eval-v3"
    metrics:
      accuracy: 0.94
      user_satisfaction: 0.88
      avg_latency_ms: 1200
    evaluated_at: "2025-01-20"

config:
  model: "gpt-4o"
  temperature: 0.7
  max_tokens: 500
  top_p: 0.95

prompt:
  system: |
    You are a customer support agent for TechCorp.

    ## Guidelines
    - Be helpful, concise, and empathetic
    - If customer is frustrated, acknowledge their feelings first
    - Always offer to escalate if you can't resolve the issue
    - Never make promises about refunds without checking policy

    ## Capabilities
    - Check order status
    - Process returns (within 30 days)
    - Answer product questions
    - Schedule callbacks

    ## Limitations
    - Cannot access payment details
    - Cannot modify existing orders
    - Must escalate billing disputes

  user: |
    Customer message: {customer_message}

    Order history: {order_history}

    Previous conversation: {conversation_history}

5.3 Prompt Registry Service

@Service
public class PromptRegistry {

    private final PromptRepository promptRepo;
    private final CacheManager cacheManager;

    @Cacheable(value = "prompts", key = "#promptId + ':' + #version")
    public PromptTemplate getPrompt(String promptId, String version) {
        return promptRepo.findByIdAndVersion(promptId, version)
            .map(this::toPromptTemplate)
            .orElseThrow(() -> new PromptNotFoundException(promptId, version));
    }

    public PromptTemplate getLatestPrompt(String promptId) {
        return promptRepo.findLatestByStatus(promptId, "production")
            .map(this::toPromptTemplate)
            .orElseThrow(() -> new PromptNotFoundException(promptId));
    }

    @Transactional
    public PromptVersion createVersion(String promptId, PromptVersionRequest request) {
        // Validate prompt syntax
        validatePromptSyntax(request.getPromptContent());

        // Create new version
        PromptVersion newVersion = PromptVersion.builder()
            .promptId(promptId)
            .version(incrementVersion(promptId))
            .content(request.getPromptContent())
            .config(request.getConfig())
            .status("draft")
            .createdBy(getCurrentUser())
            .build();

        promptRepo.save(newVersion);

        // Invalidate cache
        cacheManager.getCache("prompts").evict(promptId);

        return newVersion;
    }

    @Transactional
    public void promoteToProduction(String promptId, String version) {
        // Demote current production version
        promptRepo.findByIdAndStatus(promptId, "production")
            .ifPresent(current -> {
                current.setStatus("deprecated");
                promptRepo.save(current);
            });

        // Promote new version
        PromptVersion newProd = promptRepo.findByIdAndVersion(promptId, version)
            .orElseThrow();
        newProd.setStatus("production");
        newProd.setPromotedAt(Instant.now());
        promptRepo.save(newProd);

        // Clear all caches for this prompt
        cacheManager.getCache("prompts").clear();
    }
}

6. CI/CD Integration

6.1 GitHub Actions Workflow

# .github/workflows/prompt-evaluation.yml
name: Prompt Evaluation Pipeline

on:
  pull_request:
    paths:
      - 'prompts/**'

env:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

jobs:
  syntax-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Validate YAML syntax
        run: |
          pip install yamllint
          yamllint prompts/

      - name: Validate prompt schema
        run: |
          python scripts/validate_prompts.py prompts/

  evaluate:
    runs-on: ubuntu-latest
    needs: syntax-check
    steps:
      - uses: actions/checkout@v4

      - name: Set up Java
        uses: actions/setup-java@v4
        with:
          java-version: '21'
          distribution: 'temurin'

      - name: Identify changed prompts
        id: changes
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD | grep "^prompts/" | head -20)
          echo "changed_prompts=$CHANGED" >> $GITHUB_OUTPUT

      - name: Run evaluations
        run: |
          ./mvnw test -Dtest=PromptEvaluationTest \
            -Dprompts.changed="${{ steps.changes.outputs.changed_prompts }}"

      - name: Check quality gates
        run: |
          python scripts/check_quality_gates.py \
            --results target/eval-results.json \
            --min-accuracy 0.90 \
            --min-relevance 0.85

      - name: Upload evaluation report
        uses: actions/upload-artifact@v4
        with:
          name: evaluation-report
          path: target/eval-results.json

      - name: Comment PR with results
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('target/eval-results.json'));

            const comment = `## Prompt Evaluation Results

            | Metric | Value | Threshold | Status |
            |--------|-------|-----------|--------|
            | Accuracy | ${results.accuracy.toFixed(3)} | 0.90 | ${results.accuracy >= 0.90 ? '✅' : '❌'} |
            | Relevance | ${results.relevance.toFixed(3)} | 0.85 | ${results.relevance >= 0.85 ? '✅' : '❌'} |
            | Avg Latency | ${results.latency_ms}ms | 2000ms | ${results.latency_ms <= 2000 ? '✅' : '❌'} |

            ${results.passed ? '**✅ All quality gates passed**' : '**❌ Quality gates failed**'}
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

  regression-test:
    runs-on: ubuntu-latest
    needs: evaluate
    steps:
      - name: Compare with baseline
        run: |
          python scripts/regression_check.py \
            --current target/eval-results.json \
            --baseline baselines/production.json \
            --max-degradation 0.05

6.2 Quality Gate Implementation

@Service
public class QualityGateService {

    private final EvaluationService evaluationService;
    private final PromptRegistry promptRegistry;

    public QualityGateResult evaluate(String promptId, String version) {
        PromptTemplate prompt = promptRegistry.getPrompt(promptId, version);
        EvaluationResult evalResult = evaluationService.runFullEvaluation(prompt);

        List<GateCheck> checks = new ArrayList<>();

        // Accuracy gate
        checks.add(new GateCheck(
            "accuracy",
            evalResult.getAccuracy(),
            0.90,
            evalResult.getAccuracy() >= 0.90
        ));

        // Relevance gate (for RAG)
        if (prompt.isRagEnabled()) {
            checks.add(new GateCheck(
                "relevance",
                evalResult.getRelevance(),
                0.85,
                evalResult.getRelevance() >= 0.85
            ));
        }

        // Latency gate
        checks.add(new GateCheck(
            "latency_p95_ms",
            evalResult.getLatencyP95(),
            2000.0,
            evalResult.getLatencyP95() <= 2000
        ));

        // Token efficiency
        checks.add(new GateCheck(
            "avg_tokens",
            evalResult.getAvgTokens(),
            1500.0,
            evalResult.getAvgTokens() <= 1500
        ));

        // Regression check against production baseline
        if (promptRegistry.hasProductionVersion(promptId)) {
            EvaluationResult baseline = getProductionBaseline(promptId);
            double degradation = (baseline.getAccuracy() - evalResult.getAccuracy())
                / baseline.getAccuracy();

            checks.add(new GateCheck(
                "regression",
                degradation,
                0.05,  // Max 5% degradation
                degradation <= 0.05
            ));
        }

        boolean allPassed = checks.stream().allMatch(GateCheck::passed);

        return new QualityGateResult(
            promptId,
            version,
            allPassed,
            checks,
            allPassed ? "Ready for deployment" : "Quality gates failed"
        );
    }
}

7. Production Monitoring

7.1 Metrics Collection

@Component
public class PromptMetricsCollector {

    private final MeterRegistry meterRegistry;

    public void recordRequest(PromptExecution execution) {
        // Latency
        meterRegistry.timer("prompt.latency",
            "prompt_id", execution.getPromptId(),
            "version", execution.getVersion())
            .record(Duration.ofMillis(execution.getLatencyMs()));

        // Token usage
        meterRegistry.counter("prompt.tokens.input",
            "prompt_id", execution.getPromptId())
            .increment(execution.getInputTokens());

        meterRegistry.counter("prompt.tokens.output",
            "prompt_id", execution.getPromptId())
            .increment(execution.getOutputTokens());

        // Cost estimation
        double cost = calculateCost(
            execution.getModel(),
            execution.getInputTokens(),
            execution.getOutputTokens()
        );
        meterRegistry.counter("prompt.cost.usd",
            "prompt_id", execution.getPromptId(),
            "model", execution.getModel())
            .increment(cost);

        // Error tracking
        if (execution.isError()) {
            meterRegistry.counter("prompt.errors",
                "prompt_id", execution.getPromptId(),
                "error_type", execution.getErrorType())
                .increment();
        }
    }

    public void recordFeedback(String promptId, boolean positive) {
        meterRegistry.counter("prompt.feedback",
            "prompt_id", promptId,
            "sentiment", positive ? "positive" : "negative")
            .increment();
    }
}

7.2 Monitoring Dashboard Queries

# Grafana dashboard configuration
panels:
  - title: "Prompt Latency (P95)"
    query: |
      histogram_quantile(0.95,
        sum(rate(prompt_latency_seconds_bucket[5m])) by (le, prompt_id)
      )
    alert:
      threshold: 3
      condition: "> 3s for 5 minutes"

  - title: "Error Rate by Prompt"
    query: |
      sum(rate(prompt_errors_total[5m])) by (prompt_id, error_type)
      /
      sum(rate(prompt_requests_total[5m])) by (prompt_id)
    alert:
      threshold: 0.05
      condition: "> 5% error rate"

  - title: "User Satisfaction Rate"
    query: |
      sum(prompt_feedback_total{sentiment="positive"}) by (prompt_id)
      /
      sum(prompt_feedback_total) by (prompt_id)
    alert:
      threshold: 0.80
      condition: "< 80% satisfaction"

  - title: "Cost per 1K Requests"
    query: |
      (sum(rate(prompt_cost_usd_total[1h])) by (prompt_id) * 1000)
      /
      (sum(rate(prompt_requests_total[1h])) by (prompt_id))

7.3 Alerting Configuration

# alerts/prompt-alerts.yaml
alerts:
  - name: high_error_rate
    description: "Prompt error rate above threshold"
    query: |
      sum(rate(prompt_errors_total[5m])) by (prompt_id)
      / sum(rate(prompt_requests_total[5m])) by (prompt_id)
      > 0.05
    severity: critical
    channels: ["pagerduty", "slack-ai-alerts"]

  - name: latency_degradation
    description: "P95 latency significantly increased"
    query: |
      histogram_quantile(0.95, rate(prompt_latency_seconds_bucket[10m]))
      > 3
    severity: warning
    channels: ["slack-ai-alerts"]

  - name: satisfaction_drop
    description: "User satisfaction dropped below 80%"
    query: |
      sum(prompt_feedback_total{sentiment="positive"}) by (prompt_id)
      / sum(prompt_feedback_total) by (prompt_id)
      < 0.80
    severity: warning
    channels: ["slack-ai-alerts"]

  - name: cost_spike
    description: "Unusual cost increase detected"
    query: |
      rate(prompt_cost_usd_total[1h])
      > 2 * avg_over_time(rate(prompt_cost_usd_total[1h])[24h:1h])
    severity: warning
    channels: ["slack-ai-alerts"]

8. Continuous Improvement Workflow

┌─────────────────────────────────────────────────────────────────────────┐
│                    Continuous Improvement Loop                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     │
│   │ COLLECT  │ ──→ │ ANALYZE  │ ──→ │ ITERATE  │ ──→ │ VALIDATE │     │
│   │          │     │          │     │          │     │          │     │
│   │ • Traces │     │ • Find   │     │ • Create │     │ • Run    │     │
│   │ • Errors │     │   failure│     │   variant│     │   evals  │     │
│   │ • Feedback     │   patterns│     │ • A/B    │     │ • Quality│     │
│   │ • Metrics│     │ • Cluster│     │   test   │     │   gates  │     │
│   └──────────┘     │   issues │     └──────────┘     └──────────┘     │
│        ▲           └──────────┘           │               │           │
│        │                                  │               │           │
│        │     ┌──────────┐     ┌──────────┐               │           │
│        │     │ MONITOR  │ ←── │  DEPLOY  │ ←─────────────┘           │
│        │     │          │     │          │                           │
│        │     │ • Alerts │     │ • Promote│                           │
│        │     │ • Dashbrd│     │   winner │                           │
│        │     │ • Anomaly│     │ • Update │                           │
│        └─────│   detect │     │   registry                           │
│              └──────────┘     └──────────┘                           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

8.1 Production Trace Collection

@Service
public class TraceCollector {

    private final TraceRepository traceRepo;
    private final EvaluationDatasetBuilder datasetBuilder;

    @Async
    public void collectTrace(PromptTrace trace) {
        // Store trace
        traceRepo.save(trace);

        // Automatically flag interesting cases
        if (shouldFlagForReview(trace)) {
            flagForHumanReview(trace);
        }

        // Convert negative feedback to test cases
        if (trace.getFeedback() != null && !trace.getFeedback().isPositive()) {
            datasetBuilder.addNegativeExample(
                trace.getPromptId(),
                trace.getQuery(),
                trace.getResponse(),
                trace.getFeedback().getComment()
            );
        }
    }

    private boolean shouldFlagForReview(PromptTrace trace) {
        return trace.getLatencyMs() > 5000 ||  // Slow
               trace.isError() ||               // Failed
               trace.getOutputTokens() > 2000 || // Too verbose
               containsSensitivePattern(trace.getResponse());  // Safety concern
    }
}

8.2 Automated Test Case Generation

@Service
public class TestCaseGenerator {

    private final TraceRepository traceRepo;
    private final ChatClient judgeClient;

    public List<TestCase> generateFromProduction(
            String promptId,
            int count,
            TestCaseStrategy strategy) {

        List<PromptTrace> traces = switch (strategy) {
            case FAILURES -> traceRepo.findFailedTraces(promptId, count);
            case EDGE_CASES -> traceRepo.findEdgeCases(promptId, count);
            case DIVERSE -> traceRepo.findDiverseTraces(promptId, count);
            case NEGATIVE_FEEDBACK -> traceRepo.findNegativeFeedback(promptId, count);
        };

        return traces.stream()
            .map(this::traceToTestCase)
            .filter(Objects::nonNull)
            .toList();
    }

    private TestCase traceToTestCase(PromptTrace trace) {
        // Use LLM to generate expected output from human feedback
        if (trace.getFeedback() != null) {
            String expectedOutput = generateExpectedOutput(
                trace.getQuery(),
                trace.getResponse(),
                trace.getFeedback().getComment()
            );

            return new TestCase(
                trace.getQuery(),
                expectedOutput,
                TestCase.Source.PRODUCTION_FEEDBACK,
                trace.getId()
            );
        }

        return null;
    }
}

9. Best Practices Summary

Evaluation Checklist

Metric Targets by Use Case

Use Case	Primary Metric	Target	Secondary Metrics
Classification	Accuracy	>95%	F1, Latency
RAG Q&A	Faithfulness	>90%	Relevance, Latency
Summarization	ROUGE-L	>0.4	BERTScore, Length
Code Gen	Pass@1	>70%	Syntax valid, Latency
Customer Support	Satisfaction	>85%	Resolution rate
Translation	BLEU	>0.3	BERTScore

References

Anthropic. (2024). Evaluating AI Models. Anthropic Research
OpenAI. (2024). Building Evals. OpenAI Cookbook
Braintrust. (2025). Best Prompt Evaluation Tools 2025.
RAGAS. (2024). RAG Evaluation Framework. GitHub
Spring AI. (2025). Evaluation Documentation. Spring.io
Lakera. (2025). Ultimate Guide to Prompt Engineering.

Previous: 2.4 Spring AI Implementation ← Next: 3.1 Advanced Techniques →

Why Evaluation Matters​

The Evaluation Gap​

The Professional Approach​

1. Evaluation Fundamentals​

1.1 What is an Eval?​

1.2 Types of Evaluation​

1.3 Evaluation Dataset Design​

2. Evaluation Metrics Deep Dive​

2.1 Classification Metrics​

2.2 Text Generation Metrics​

2.3 RAG-Specific Metrics​

2.4 RAG Evaluation Framework (RAGAS-style)​

3. LLM-as-Judge Evaluation​

3.1 Single-Point Grading​

3.2 Pairwise Comparison​

3.3 Reference-Based Grading​

3.4 Multi-Judge Ensemble​

4. A/B Testing Infrastructure​

4.1 Experiment Framework​

4.2 Experiment Configuration​

4.3 Statistical Analysis​

5. Prompt Version Control​

5.1 File-Based Version Control​

5.2 Prompt Template Schema​

5.3 Prompt Registry Service​

6. CI/CD Integration​

6.1 GitHub Actions Workflow​

6.2 Quality Gate Implementation​

7. Production Monitoring​

7.1 Metrics Collection​

7.2 Monitoring Dashboard Queries​

7.3 Alerting Configuration​

8. Continuous Improvement Workflow​

8.1 Production Trace Collection​

8.2 Automated Test Case Generation​

9. Best Practices Summary​

Evaluation Checklist​

Metric Targets by Use Case​

References​