6 评估与版本控制

为什么评估很重要

没有度量，提示词工程就是盲目猜测。 生产级 AI 系统需要系统化的评估、版本控制和持续改进——就像传统软件一样。

评估鸿沟

传统软件开发:                  AI/提示词开发:
┌─────────────────────────────┐        ┌─────────────────────────────┐
│ ✅ 单元测试               │        │ ❌ “看起来不错”          │
│ ✅ 集成测试                │        │ ❌ 手动抽查               │
│ ✅ 代码覆盖率              │        │ ❌ 凭感觉迭代            │
│ ✅ CI/CD 质量门            │        │ ❌ 部署后祈祷            │
│ ✅ 性能基准测试            │        │ ❌ 未知回归问题          │
└─────────────────────────────┘        └─────────────────────────────┘

        良好工程实践          vs         “提示词感觉流”

专业化方法

系统化提示词工程:
┌──────────────────────────────────────────────────────────────────────┐
│  定义 → 度量 → 迭代 → 验证 → 部署 → 监控 → 重复此循环                  │
├──────────────────────────────────────────────────────────────────────┤
│  ✅ 带有真实标签的评估数据集                                        │
│  ✅ 自动化指标 (准确性、相关性、连贯性)                              │
│  ✅ LLM 作为评判者，用于主观质量评估                                │
│  ✅ A/B 测试基础设施                                                │
│  ✅ 提示词版本控制                                                  │
│  ✅ CI/CD 质量门                                                    │
│  ✅ 生产监控与告警                                                  │
└──────────────────────────────────────────────────────────────────────┘

1. 评估基础

1.1 什么是 Eval？

一个 Eval (评估) 是一个结构化测试，用于衡量提示词在特定任务上的表现。它由以下几部分组成：

Eval 组成部分:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  1. 数据集                 2. 指标                 3. 阈值          │
│  ┌─────────────────┐       ┌─────────────────┐     ┌─────────────┐ │
│  │ 输入: “法国的    │       │ 准确性: 95%     │     │ 通过: >90%  │ │
│  │   首都是什么?”   │  →    │ 相关性: 0.87    │  →  │ 失败: <90%  │ │
│  │ 预期: 巴黎        │       │ 延迟: 1.2s      │     │             │ │
│  └─────────────────┘       └─────────────────┘     └─────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1.2 评估类型

类型	描述	使用场景
离线评估 (Offline Eval)	在测试数据集上进行批量评估	开发、CI/CD
在线评估 (Online Eval)	对真实用户进行 A/B 测试	生产验证
LLM 作为评判者 (LLM-as-Judge)	另一个 LLM 评估响应	没有真实标签 (Ground Truth) 时
人工评估 (Human Eval)	专家人工标注	黄金标准、校准 LLM 评判者
自动化指标 (Automated Metrics)	BLEU、ROUGE、BERTScore	翻译、摘要

1.3 评估数据集设计

数据集大小指南

最小数据集大小因任务复杂性而异。过小的数据集会导致指标不可靠；过大则浪费资源。

任务类型	最小样本数	推荐样本数	备注
二元分类	100	500+	平衡类别分布
多类别 (5 个类别)	200	1000+	每个类别至少 40 个
开放式生成	50	200+	涵盖多样化场景
RAG 评估	100	300+	多样化查询类型
摘要	50	150+	不同文档长度
代码生成	100	500+	覆盖边缘案例

数据集结构示例：

{
  "dataset_id": "customer-support-v2",
  "created": "2025-01-21",
  "task_type": "classification",
  "samples": [
    {
      "id": "cs-001",
      "input": "我的订单还没到，已经两周了",
      "expected_output": "shipping_delay",
      "metadata": {
        "category": "shipping",
        "difficulty": "easy",
        "source": "production_logs"
      }
    },
    {
      "id": "cs-002",
      "input": "我想退货，但退货按钮不起作用",
      "expected_output": "return_technical_issue",
      "metadata": {
        "category": "returns",
        "difficulty": "medium",
        "source": "manual_annotation"
      }
    }
  ]
}

2. 评估指标深度解析

2.1 分类指标

public class ClassificationMetrics {

    public static double accuracy(List<Prediction> predictions) {
        long correct = predictions.stream()
            .filter(p -> p.predicted().equals(p.expected()))
            .count();
        return (double) correct / predictions.size();
    }

    public static double precision(List<Prediction> predictions, String positiveClass) {
        long truePositives = predictions.stream()
            .filter(p -> p.predicted().equals(positiveClass) &&
                        p.expected().equals(positiveClass))
            .count();
        long predictedPositives = predictions.stream()
            .filter(p -> p.predicted().equals(positiveClass))
            .count();
        return predictedPositives == 0 ? 0 : (double) truePositives / predictedPositives;
    }

    public static double recall(List<Prediction> predictions, String positiveClass) {
        long truePositives = predictions.stream()
            .filter(p -> p.predicted().equals(positiveClass) &&
                        p.expected().equals(positiveClass))
            .count();
        long actualPositives = predictions.stream()
            .filter(p -> p.expected().equals(positiveClass))
            .count();
        return actualPositives == 0 ? 0 : (double) truePositives / actualPositives;
    }

    public static double f1Score(double precision, double recall) {
        if (precision + recall == 0) return 0;
        return 2 * (precision * recall) / (precision + recall);
    }
}

2.2 文本生成指标

指标	公式/描述	最适合	局限性
BLEU	N-gram 精度重叠	机器翻译	惩罚意译
ROUGE-N	N-gram 召回重叠	摘要	忽略语义
ROUGE-L	最长公共子序列	摘要	对顺序敏感
BERTScore	语义嵌入相似度	任何生成任务	计算密集
METEOR	考虑同义词的调和平均值	机器翻译	需要资源

Python 实现示例：

# 使用 evaluate 库
import evaluate

# BLEU Score
bleu = evaluate.load("bleu")
results = bleu.compute(
    predictions=["猫 坐在 垫子 上"],
    references=[["猫 在 垫子 上"]]
)
print(f"BLEU: {results['bleu']:.3f}")

# ROUGE Score
rouge = evaluate.load("rouge")
results = rouge.compute(
    predictions=["AI 正在改变医疗保健"],
    references=["人工智能正在彻底改变医疗保健行业"]
)
print(f"ROUGE-L: {results['rougeL']:.3f}")

# BERTScore (语义相似度)
bertscore = evaluate.load("bertscore")
results = bertscore.compute(
    predictions=["今天天气很好"],
    references=["外面天气真好"],
    lang="zh" # 指定中文
)
print(f"BERTScore F1: {results['f1'][0]:.3f}")

2.3 RAG 特定指标

public class RagMetrics {

    /**
     * 衡量检索到的上下文与查询的相关性
     */
    public static double contextRelevance(
            String query,
            List<Document> retrievedDocs,
            EmbeddingModel embeddingModel) {

        float[] queryEmbedding = embeddingModel.embed(query);

        return retrievedDocs.stream()
            .mapToDouble(doc -> {
                float[] docEmbedding = embeddingModel.embed(doc.getContent());
                return cosineSimilarity(queryEmbedding, docEmbedding); // 余弦相似度
            })
            .average()
            .orElse(0.0);
    }

    /**
     * 衡量答案在多大程度上基于检索到的上下文
     */
    public static double faithfulness(
            String answer,
            List<Document> context,
            ChatClient judgeClient) {

        String prompt = """
            给定以下上下文和答案，请评估答案在多大程度上得到上下文的支持，评分范围 0-1。

            上下文:
            %s

            答案:
            %s

            仅返回一个介于 0 和 1 之间的数字。
            """.formatted(
                context.stream().map(Document::getContent).collect(joining("

")),
                answer
            );

        String score = judgeClient.prompt().user(prompt).call().content();
        return Double.parseDouble(score.trim());
    }

    /**
     * 衡量答案是否真正解决了问题
     */
    public static double answerRelevance(
            String query,
            String answer,
            ChatClient judgeClient) {

        String prompt = """
            评估此答案对问题的解决程度，评分范围 0-1。

            问题: %s
            答案: %s

            仅返回一个介于 0 和 1 之间的数字。
            """.formatted(query, answer);

        String score = judgeClient.prompt().user(prompt).call().content();
        return Double.parseDouble(score.trim());
    }
}

2.4 RAG 评估框架 (RAGAS-风格)

RAG 评估维度:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  ┌─────────────────┐     ┌─────────────────┐     ┌───────────────┐ │
│  │ 上下文          │     │ 忠实性          │     │ 答案          │ │
│  │ 相关性          │     │                 │     │ 相关性        │ │
│  │                 │     │                 │     │               │ │
│  │ “检索到的文档   │     │ “答案是否      │     │ “答案是否     │ │
│  │  与查询相关吗?”  │     │  基于上下文?”   │     │  解决了问题?” │ │
│  └────────┬────────┘     └────────┬────────┘     └───────┬───────┘ │
│           │                       │                      │         │
│           └───────────────────────┼──────────────────────┘         │
│                                   ▼                                 │
│                    ┌─────────────────────────────┐                 │
│                    │   总体 RAG 分数             │                 │
│                    │   = 加权平均                │                 │
│                    └─────────────────────────────┘                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

3. LLM 作为评判者 (LLM-as-Judge) 评估

当真实标签不存在或具有主观性时，使用另一个 LLM 进行评估。

3.1 单点评分

@Service
public class LlmJudgeService {

    private final ChatClient judgeClient;

    public EvaluationResult evaluateResponse(
            String query,
            String response,
            List<String> criteria) {

        String criteriaList = criteria.stream()
            .map(c -> "- " + c)
            .collect(Collectors.joining("
"));

        String prompt = """
            你是一名专家评估员。请评估以下响应。

            ## 问题
            %s

            ## 响应
            %s

            ## 评估标准
            %s

            ## 指令
            针对每个标准，提供：
            1. 评分 (1-5，5 为优秀)
            2. 简要理由

            以 JSON 格式返回评估：
            {
              "scores": {
                "criterion_name": {"score": X, "reason": "..."}
              },
              "overall_score": X.X,
              "summary": "总体评估..."
            }
            """.formatted(query, response, criteriaList);

        String result = judgeClient.prompt()
            .user(prompt)
            .call()
            .content();

        return parseEvaluationResult(result);
    }
}

3.2 成对比较

public class PairwiseJudge {

    private final ChatClient judgeClient;

    public ComparisonResult compare(
            String query,
            String responseA,
            String responseB) {

        String prompt = """
            请比较对相同问题的这两个响应。

            ## 问题
            %s

            ## 响应 A
            %s

            ## 响应 B
            %s

            ## 指令
            哪个响应更好？请考虑：
            - 准确性和正确性
            - 完整性
            - 清晰度和帮助性
            - 简洁性

            返回 JSON：
            {
              "winner": "A" 或 "B" 或 "tie",
              "confidence": 0.0-1.0,
              "reasoning": "..."
            }
            """.formatted(query, responseA, responseB);

        // 通过反转顺序进行测试，减少位置偏差
        String promptReversed = prompt
            .replace("响应 A", "响应 X")
            .replace("响应 B", "响应 A")
            .replace("响应 X", "响应 B");

        String result1 = judgeClient.prompt().user(prompt).call().content();
        String result2 = judgeClient.prompt().user(promptReversed).call().content();

        return reconcileResults(result1, result2);
    }
}

3.3 基于参考的评分

public class ReferenceGrader {

    private final ChatClient judgeClient;

    public GradingResult gradeWithReference(
            String query,
            String response,
            String referenceAnswer) {

        String prompt = """
            请根据参考答案评估此响应。

            ## 问题
            %s

            ## 学生响应
            %s

            ## 参考答案
            %s

            ## 评分标准
            - 5: 等同于或优于参考答案
            - 4: 大部分正确，少量遗漏
            - 3: 部分正确，存在一些错误
            - 2: 存在重大错误或内容缺失
            - 1: 不正确或不相关

            返回 JSON：
            {
              "grade": X,
              "correct_elements": ["..."],
              "missing_elements": ["..."],
              "errors": ["..."],
              "feedback": "..."
            }
            """.formatted(query, response, referenceAnswer);

        return parseGradingResult(
            judgeClient.prompt().user(prompt).call().content()
        );
    }
}

3.4 多评判者集成 (Multi-Judge Ensemble)

@Service
public class EnsembleJudge {

    private final List<ChatClient> judges;  // 不同的模型

    public EnsembleResult evaluate(String query, String response) {
        List<Double> scores = judges.parallelStream()
            .map(judge -> evaluateWithJudge(judge, query, response))
            .toList();

        double mean = scores.stream().mapToDouble(d -> d).average().orElse(0);
        double variance = scores.stream()
            .mapToDouble(s -> Math.pow(s - mean, 2))
            .average()
            .orElse(0);

        return new EnsembleResult(
            mean,
            Math.sqrt(variance),  // 标准差
            scores,
            variance > 0.5 ? "分歧较大 - 需要人工审查" : "一致"
        );
    }

    private double evaluateWithJudge(ChatClient judge, String query, String response) {
        // 对所有评判者使用相同的评估提示词
        String prompt = createEvaluationPrompt(query, response);
        return Double.parseDouble(judge.prompt().user(prompt).call().content().trim());
    }
}

4. A/B 测试基础设施

4.1 实验框架

@Component
public class PromptExperimentService {

    private final ExperimentRepository experimentRepo;
    private final MetricsCollector metricsCollector;
    private final Map<String, ChatClient> variants;

    public ExperimentResult runExperiment(
            String experimentId,
            String userId,
            String query) {

        Experiment experiment = experimentRepo.findById(experimentId)
            .orElseThrow(() -> new ExperimentNotFoundException(experimentId));

        // 基于用户 ID 进行确定性分配
        String variantId = assignVariant(userId, experiment);
        ChatClient client = variants.get(variantId);

        // 执行并度量
        long startTime = System.currentTimeMillis();
        String response = client.prompt().user(query).call().content();
        long latency = System.currentTimeMillis() - startTime;

        // 记录指标
        metricsCollector.record(ExperimentMetric.builder()
            .experimentId(experimentId)
            .variantId(variantId)
            .userId(userId)
            .query(query)
            .response(response)
            .latencyMs(latency)
            .timestamp(Instant.now())
            .build());

        return new ExperimentResult(variantId, response, latency);
    }

    private String assignVariant(String userId, Experiment experiment) {
        // 使用一致性哈希确保稳定分配
        int hash = Math.abs(userId.hashCode() % 100);
        int cumulative = 0;

        for (Variant variant : experiment.getVariants()) {
            cumulative += variant.getTrafficPercentage();
            if (hash < cumulative) {
                return variant.getId();
            }
        }

        return experiment.getVariants().get(0).getId();  // 兜底方案
    }
}

4.2 实验配置

# experiments/chat-prompt-v2.yaml
experiment:
  id: "chat-prompt-v2-test"
  name: "测试新系统提示词"
  description: "比较简洁版和详细版系统提示词"
  start_date: "2025-01-21"
  end_date: "2025-02-21"

  variants:
    - id: "control"
      name: "当前生产版本"
      traffic_percentage: 50
      prompt_version: "chat-v1.0"

    - id: "treatment"
      name: "新简洁提示词"
      traffic_percentage: 50
      prompt_version: "chat-v2.0"

  metrics:
    primary:
      - name: "user_satisfaction"
        type: "thumbs_up_rate"
        minimum_improvement: 0.05  # 需要 5% 的提升

    secondary:
      - name: "response_latency_p95"
        type: "latency_percentile"
        threshold_ms: 3000

      - name: "token_usage"
        type: "average_tokens"

      - name: "task_completion_rate"
        type: "conversion"

  guardrails:
    min_sample_size: 1000 # 最小样本量
    max_degradation: 0.10  # 最大允许 10% 的性能下降
    confidence_level: 0.95 # 统计置信水平

4.3 统计分析

@Service
public class ExperimentAnalyzer {

    public AnalysisResult analyze(String experimentId) {
        List<ExperimentMetric> controlMetrics = metricsRepo
            .findByExperimentAndVariant(experimentId, "control");
        List<ExperimentMetric> treatmentMetrics = metricsRepo
            .findByExperimentAndVariant(experimentId, "treatment");

        // 样本量检查
        if (controlMetrics.size() < 1000 || treatmentMetrics.size() < 1000) {
            return AnalysisResult.insufficientData(); // 数据不足
        }

        // 计算指标
        double controlSatisfaction = calculateSatisfactionRate(controlMetrics);
        double treatmentSatisfaction = calculateSatisfactionRate(treatmentMetrics);

        // 统计显著性 (双比例 Z 检验)
        double zScore = calculateZScore(
            controlSatisfaction, controlMetrics.size(),
            treatmentSatisfaction, treatmentMetrics.size()
        );
        double pValue = calculatePValue(zScore);

        // 效应大小 (Effect size)
        double relativeImprovement =
            (treatmentSatisfaction - controlSatisfaction) / controlSatisfaction;

        return AnalysisResult.builder()
            .controlMetric(controlSatisfaction)
            .treatmentMetric(treatmentSatisfaction)
            .absoluteDifference(treatmentSatisfaction - controlSatisfaction)
            .relativeImprovement(relativeImprovement)
            .pValue(pValue)
            .isSignificant(pValue < 0.05) // p 值小于 0.05 则认为显著
            .recommendation(generateRecommendation(pValue, relativeImprovement))
            .build();
    }

    private String generateRecommendation(double pValue, double improvement) {
        if (pValue >= 0.05) {
            return "继续运行 - 尚未达到统计显著性";
        }
        if (improvement > 0.05) {
            return "上线 - 显著积极改进";
        }
        if (improvement < -0.05) {
            return "回滚 - 显著负面影响";
        }
        return "无变化 - 差异太小无关紧要";
    }
}

5. 提示词版本控制

5.1 基于文件的版本控制

prompts/
├── system/
│   ├── customer-support/
│   │   ├── v1.0.yaml
│   │   ├── v1.1.yaml
│   │   └── v2.0.yaml
│   └── code-assistant/
│       └── v1.0.yaml
├── tasks/
│   ├── summarization/
│   │   └── v1.0.yaml
│   └── classification/
│       └── v1.0.yaml
└── experiments/
    ├── exp-001-concise-prompt/
    │   ├── control.yaml
    │   └── treatment.yaml
    └── exp-002-few-shot/
        ├── zero-shot.yaml
        └── three-shot.yaml

5.2 提示词模板 Schema

# prompts/system/customer-support/v2.0.yaml
metadata:
  id: "customer-support-v2.0"
  version: "2.0.0"
  created: "2025-01-21"
  author: "ai-team"
  status: "production"  # draft (草稿), staging (预发布), production (生产), deprecated (废弃)
  parent_version: "1.1.0"

  change_log: |
    - 添加了产品退货处理
    - 改进了对沮丧客户的语气
    - 响应长度减少了 20%

  evaluation:
    dataset: "customer-support-eval-v3"
    metrics:
      accuracy: 0.94
      user_satisfaction: 0.88
      avg_latency_ms: 1200
    evaluated_at: "2025-01-20"

config:
  model: "gpt-4o"
  temperature: 0.7
  max_tokens: 500
  top_p: 0.95

prompt:
  system: |
    你是一名 TechCorp 的客户支持 Agent。

    ## 指南
    - 乐于助人、简洁、富有同情心
    - 如果客户沮丧，请首先承认他们的感受
    - 如果无法解决问题，请务必提出升级处理
    - 在未检查政策前，绝不就退款做出承诺

    ## 能力
    - 查看订单状态
    - 处理退货 (30 天内)
    - 回答产品问题
    - 安排回电

    ## 限制
    - 无法访问支付详情
    - 无法修改现有订单
    - 必须升级账单争议

  user: |
    客户消息: {customer_message}

    订单历史: {order_history}

    过往对话: {conversation_history}

5.3 提示词注册服务 (Prompt Registry Service)

@Service
public class PromptRegistry {

    private final PromptRepository promptRepo;
    private final CacheManager cacheManager;

    @Cacheable(value = "prompts", key = "#promptId + ':' + #version")
    public PromptTemplate getPrompt(String promptId, String version) {
        return promptRepo.findByIdAndVersion(promptId, version)
            .map(this::toPromptTemplate)
            .orElseThrow(() -> new PromptNotFoundException(promptId, version));
    }

    public PromptTemplate getLatestPrompt(String promptId) {
        return promptRepo.findLatestByStatus(promptId, "production")
            .map(this::toPromptTemplate)
            .orElseThrow(() -> new PromptNotFoundException(promptId));
    }

    @Transactional
    public PromptVersion createVersion(String promptId, PromptVersionRequest request) {
        // 验证提示词语法
        validatePromptSyntax(request.getPromptContent());

        // 创建新版本
        PromptVersion newVersion = PromptVersion.builder()
            .promptId(promptId)
            .version(incrementVersion(promptId))
            .content(request.getPromptContent())
            .config(request.getConfig())
            .status("draft")
            .createdBy(getCurrentUser())
            .build();

        promptRepo.save(newVersion);

        // 使缓存失效
        cacheManager.getCache("prompts").evict(promptId);

        return newVersion;
    }

    @Transactional
    public void promoteToProduction(String promptId, String version) {
        // 降级当前生产版本
        promptRepo.findByIdAndStatus(promptId, "production")
            .ifPresent(current -> {
                current.setStatus("deprecated");
                promptRepo.save(current);
            });

        // 提升新版本到生产环境
        PromptVersion newProd = promptRepo.findByIdAndVersion(promptId, version)
            .orElseThrow();
        newProd.setStatus("production");
        newProd.setPromotedAt(Instant.now());
        promptRepo.save(newProd);

        // 清除此提示词的所有缓存
        cacheManager.getCache("prompts").clear();
    }
}

6. CI/CD 集成

6.1 GitHub Actions 工作流

# .github/workflows/prompt-evaluation.yml
name: 提示词评估流水线

on:
  pull_request:
    paths:
      - 'prompts/**'

env:
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

jobs:
  syntax-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: 验证 YAML 语法
        run: |
          pip install yamllint
          yamllint prompts/

      - name: 验证提示词 Schema
        run: |
          python scripts/validate_prompts.py prompts/

  evaluate:
    runs-on: ubuntu-latest
    needs: syntax-check
    steps:
      - uses: actions/checkout@v4

      - name: 设置 Java JDK 21
        uses: actions/setup-java@v4
        with:
          java-version: '21'
          distribution: 'temurin'
          cache: maven

      - name: 识别变更的提示词
        id: changes
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD | grep "^prompts/" | head -20)
          echo "changed_prompts=$CHANGED" >> $GITHUB_OUTPUT

      - name: 运行评估
        run: |
          ./mvnw test -Dtest=PromptEvaluationTest 
            -Dprompts.changed="${{ steps.changes.outputs.changed_prompts }}"

      - name: 检查质量门
        run: |
          python scripts/check_quality_gates.py 
            --results target/eval-results.json 
            --min-accuracy 0.90 
            --min-relevance 0.85

      - name: 上传评估报告
        uses: actions/upload-artifact@v4
        with:
          name: evaluation-report
          path: target/eval-results.json

      - name: 评论 PR 结果
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('target/eval-results.json'));

            const comment = `## 提示词评估结果

            | 指标 | 值 | 阈值 | 状态 |
            |--------|-------|-----------|--------|
            | 准确性 | ${results.accuracy.toFixed(3)} | 0.90 | ${results.accuracy >= 0.90 ? '✅' : '❌'} |
            | 相关性 | ${results.relevance.toFixed(3)} | 0.85 | ${results.relevance >= 0.85 ? '✅' : '❌'} |
            | 平均延迟 | ${results.latency_ms}ms | 2000ms | ${results.latency_ms <= 2000 ? '✅' : '❌'} |

            ${results.passed ? '**✅ 所有质量门已通过**' : '**❌ 质量门未通过**'}
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

  regression-test:
    runs-on: ubuntu-latest
    needs: evaluate
    steps:
      - name: 与基线对比
        run: |
          python scripts/regression_check.py 
            --current target/eval-results.json 
            --baseline baselines/production.json 
            --max-degradation 0.05

6.2 质量门 (Quality Gate) 实现

@Service
public class QualityGateService {

    private final EvaluationService evaluationService;
    private final PromptRegistry promptRegistry;

    public QualityGateResult evaluate(String promptId, String version) {
        PromptTemplate prompt = promptRegistry.getPrompt(promptId, version);
        EvaluationResult evalResult = evaluationService.runFullEvaluation(prompt);

        List<GateCheck> checks = new ArrayList<>();

        // 准确性质量门
        checks.add(new GateCheck(
            "accuracy",
            evalResult.getAccuracy(),
            0.90,
            evalResult.getAccuracy() >= 0.90
        ));

        // 相关性质量门 (针对 RAG)
        if (prompt.isRagEnabled()) {
            checks.add(new GateCheck(
                "relevance",
                evalResult.getRelevance(),
                0.85,
                evalResult.getRelevance() >= 0.85
            ));
        }

        // 延迟质量门
        checks.add(new GateCheck(
            "latency_p95_ms",
            evalResult.getLatencyP95(),
            2000.0,
            evalResult.getLatencyP95() <= 2000
        ));

        // Token 效率
        checks.add(new GateCheck(
            "avg_tokens",
            evalResult.getAvgTokens(),
            1500.0,
            evalResult.getAvgTokens() <= 1500
        ));

        // 对生产基线的回归检查
        if (promptRegistry.hasProductionVersion(promptId)) {
            EvaluationResult baseline = getProductionBaseline(promptId);
            double degradation = (baseline.getAccuracy() - evalResult.getAccuracy())
                / baseline.getAccuracy();

            checks.add(new GateCheck(
                "regression",
                degradation,
                0.05,  // 最大允许 5% 的性能下降
                degradation <= 0.05
            ));
        }

        boolean allPassed = checks.stream().allMatch(GateCheck::passed);

        return new QualityGateResult(
            promptId,
            version,
            allPassed,
            checks,
            allPassed ? "准备部署" : "质量门未通过"
        );
    }
}

7. 生产监控

7.1 指标收集

@Component
public class PromptMetricsCollector {

    private final MeterRegistry meterRegistry;

    public void recordRequest(PromptExecution execution) {
        // 延迟
        meterRegistry.timer("prompt.latency",
            "prompt_id", execution.getPromptId(),
            "version", execution.getVersion())
            .record(Duration.ofMillis(execution.getLatencyMs()));

        // Token 用量
        meterRegistry.counter("prompt.tokens.input",
            "prompt_id", execution.getPromptId())
            .increment(execution.getInputTokens());

        meterRegistry.counter("prompt.tokens.output",
            "prompt_id", execution.getPromptId())
            .increment(execution.getOutputTokens());

        // 成本估算
        double cost = calculateCost(
            execution.getModel(),
            execution.getInputTokens(),
            execution.getOutputTokens()
        );
        meterRegistry.counter("prompt.cost.usd",
            "prompt_id", execution.getPromptId(),
            "model", execution.getModel())
            .increment(cost);

        // 错误跟踪
        if (execution.isError()) {
            meterRegistry.counter("prompt.errors",
                "prompt_id", execution.getPromptId(),
                "error_type", execution.getErrorType())
                .increment();
        }
    }

    public void recordFeedback(String promptId, boolean positive) {
        meterRegistry.counter("prompt.feedback",
            "prompt_id", promptId,
            "sentiment", positive ? "positive" : "negative")
            .increment();
    }
}

7.2 监控仪表盘查询 (Grafana)

# Grafana 仪表盘配置
panels:
  - title: "提示词延迟 (P95)"
    query: |
      histogram_quantile(0.95,
        sum(rate(prompt_latency_seconds_bucket[5m])) by (le, prompt_id)
      )
    alert:
      threshold: 3
      condition: "> 3s for 5 minutes" # 5 分钟内超过 3 秒

  - title: "按提示词划分的错误率"
    query: |
      sum(rate(prompt_errors_total[5m])) by (prompt_id, error_type)
      /
      sum(rate(prompt_requests_total[5m])) by (prompt_id)
    alert:
      threshold: 0.05
      condition: "> 5% error rate" # 错误率超过 5%

  - title: "用户满意度"
    query: |
      sum(prompt_feedback_total{sentiment="positive"}) by (prompt_id)
      /
      sum(prompt_feedback_total) by (prompt_id)
    alert:
      threshold: 0.80
      condition: "< 80% satisfaction" # 满意度低于 80%

  - title: "每 1K 请求成本"
    query: |
      (sum(rate(prompt_cost_usd_total[1h])) by (prompt_id) * 1000)
      /
      (sum(rate(prompt_requests_total[1h])) by (prompt_id))

7.3 告警配置

# alerts/prompt-alerts.yaml
alerts:
  - name: high_error_rate
    description: "提示词错误率超过阈值"
    query: |
      sum(rate(prompt_errors_total[5m])) by (prompt_id)
      / sum(rate(prompt_requests_total[5m])) by (prompt_id)
      > 0.05
    severity: critical
    channels: ["pagerduty", "slack-ai-alerts"]

  - name: latency_degradation
    description: "P95 延迟显著增加"
    query: |
      histogram_quantile(0.95, rate(prompt_latency_seconds_bucket[10m]))
      > 3
    severity: warning
    channels: ["slack-ai-alerts"]

  - name: satisfaction_drop
    description: "用户满意度下降到 80% 以下"
    query: |
      sum(prompt_feedback_total{sentiment="positive"}) by (prompt_id)
      / sum(prompt_feedback_total) by (prompt_id)
      < 0.80
    severity: warning
    channels: ["slack-ai-alerts"]

  - name: cost_spike
    description: "检测到异常成本增长"
    query: |
      rate(prompt_cost_usd_total[1h])
      > 2 * avg_over_time(rate(prompt_cost_usd_total[1h])[24h:1h]) # 过去 24 小时平均值的两倍
    severity: warning
    channels: ["slack-ai-alerts"]

8. 持续改进工作流

┌─────────────────────────────────────────────────────────────────────────┐
│                    持续改进循环                                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐     │
│   │ 收集数据 │ ──→ │ 分析问题 │ ──→ │ 迭代优化 │ ──→ │ 验证方案 │     │
│   │          │     │          │     │          │     │          │     │
│   │ • 追踪   │     │ • 查找   │     │ • 创建   │     │ • 运行   │     │
│   │ • 错误   │     │   失败   │     │   变体   │     │   评估   │     │
│   │ • 反馈   │     │   模式   │     │ • A/B    │     │ • 质量   │     │
│   │ • 指标   │     │ • 聚类   │     │   测试   │     │   门     │     │
│   └──────────┘     │   问题   │     └──────────┘     └──────────┘     │
│        ▲           └──────────┘           │               │           │
│        │                                  │               │           │
│        │     ┌──────────┐     ┌──────────┐               │           │
│        │     │ 监控告警 │ ←── │  部署上线 │ ←─────────────┘           │
│        │     │          │     │          │                           │
│        │     │ • 告警   │     │ • 提升   │                           │
│        │     │ • 仪表盘 │     │   优胜者 │                           │
│        │     │ • 异常   │     │ • 更新   │                           │
│        └─────│   检测   │     │   注册表                           │
│              └──────────┘     └──────────┘                           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

8.1 生产追踪收集

@Service
public class TraceCollector {

    private final TraceRepository traceRepo;
    private final EvaluationDatasetBuilder datasetBuilder;

    @Async
    public void collectTrace(PromptTrace trace) {
        // 存储追踪
        traceRepo.save(trace);

        // 自动标记需要审查的案例
        if (shouldFlagForReview(trace)) {
            flagForHumanReview(trace);
        }

        // 将负面反馈转换为测试用例
        if (trace.getFeedback() != null && !trace.getFeedback().isPositive()) {
            datasetBuilder.addNegativeExample(
                trace.getPromptId(),
                trace.getQuery(),
                trace.getResponse(),
                trace.getFeedback().getComment()
            );
        }
    }

    private boolean shouldFlagForReview(PromptTrace trace) {
        return trace.getLatencyMs() > 5000 ||  // 延迟过高
               trace.isError() ||               // 出现错误
               trace.getOutputTokens() > 2000 || // 输出过于冗长
               containsSensitivePattern(trace.getResponse());  // 包含敏感模式
    }
}

8.2 自动化测试用例生成

@Service
public class TestCaseGenerator {

    private final TraceRepository traceRepo;
    private final ChatClient judgeClient;

    public List<TestCase> generateFromProduction(
            String promptId,
            int count,
            TestCaseStrategy strategy) {

        List<PromptTrace> traces = switch (strategy) {
            case FAILURES -> traceRepo.findFailedTraces(promptId, count);
            case EDGE_CASES -> traceRepo.findEdgeCases(promptId, count);
            case DIVERSE -> traceRepo.findDiverseTraces(promptId, count);
            case NEGATIVE_FEEDBACK -> traceRepo.findNegativeFeedback(promptId, count);
        };

        return traces.stream()
            .map(this::traceToTestCase)
            .filter(Objects::nonNull)
            .toList();
    }

    private TestCase traceToTestCase(PromptTrace trace) {
        // 使用 LLM 根据人工反馈生成预期输出
        if (trace.getFeedback() != null) {
            String expectedOutput = generateExpectedOutput(
                trace.getQuery(),
                trace.getResponse(),
                trace.getFeedback().getComment()
            );

            return new TestCase(
                trace.getQuery(),
                expectedOutput,
                TestCase.Source.PRODUCTION_FEEDBACK,
                trace.getId()
            );
        }

        return null;
    }
}

9. 最佳实践总结

评估清单

按用例划分的指标目标

用例	主要指标	目标	次要指标
分类	准确性	>95%	F1 分数、延迟
RAG 问答	忠实性	>90%	相关性、延迟
摘要	ROUGE-L	>0.4	BERTScore、长度
代码生成	Pass@1	>70%	语法有效、延迟
客户支持	满意度	>85%	解决率
翻译	BLEU	>0.3	BERTScore

参考文献

Anthropic. (2024). Evaluating AI Models. Anthropic 研究
OpenAI. (2024). Building Evals. OpenAI Cookbook
Braintrust. (2025). 最佳提示词评估工具 2025。
RAGAS. (2024). RAG 评估框架. GitHub
Spring AI. (2025). 评估文档. Spring.io
Lakera. (2025). 提示词工程终极指南。

上一章：2.4 Spring AI 提示词实战 ← 下一章：3.1 高级提示词技术 →

为什么评估很重要​

评估鸿沟​

专业化方法​

1. 评估基础​

1.1 什么是 Eval？​

1.2 评估类型​

1.3 评估数据集设计​

2. 评估指标深度解析​

2.1 分类指标​

2.2 文本生成指标​

2.3 RAG 特定指标​

2.4 RAG 评估框架 (RAGAS-风格)​

3. LLM 作为评判者 (LLM-as-Judge) 评估​

3.1 单点评分​

3.2 成对比较​

3.3 基于参考的评分​

3.4 多评判者集成 (Multi-Judge Ensemble)​

4. A/B 测试基础设施​

4.1 实验框架​

4.2 实验配置​

4.3 统计分析​

5. 提示词版本控制​

5.1 基于文件的版本控制​

5.2 提示词模板 Schema​

5.3 提示词注册服务 (Prompt Registry Service)​

6. CI/CD 集成​

6.1 GitHub Actions 工作流​

6.2 质量门 (Quality Gate) 实现​

7. 生产监控​

7.1 指标收集​

7.2 监控仪表盘查询 (Grafana)​

7.3 告警配置​

8. 持续改进工作流​

8.1 生产追踪收集​

8.2 自动化测试用例生成​

9. 最佳实践总结​

评估清单​

按用例划分的指标目标​

参考文献​