AI应用的性能基准测试:从指标到优化
AI应用的性能基准测试从指标到优化前言我们产品上线后用户开始抱怨AI 响应慢。但慢是多慢有没有量化指标如何优化今天分享我们是如何建立 AI 应用性能基准测试体系的。一、性能指标体系1.1 核心指标class PerformanceMetrics: METRICS { latency: { ttft: Time to First Token - 首个 token 响应时间, tps: Tokens Per Second - 生成速度, total_latency: 总响应时间 }, throughput: { qps: Queries Per Second - 每秒查询数, concurrent_users: 并发用户数 }, reliability: { availability: 可用性 99.9%, error_rate: 错误率 0.1% } }1.2 目标值TARGETS { ttft: 500, # ms tps: 50, # tokens/s total_latency: 2000, # ms qps: 100, error_rate: 0.001 }二、基准测试实现2.1 测试框架import asyncio import time from dataclasses import dataclass dataclass class BenchmarkResult: name: str latency_avg: float latency_p50: float latency_p95: float latency_p99: float qps: float error_rate: float class Benchmark: def __init__(self, client): self.client client self.results [] async def run_latency_test(self, num_requests: int 100) - list: 延迟测试 latencies [] for _ in range(num_requests): start time.time() await self.client.generate(test prompt) latency (time.time() - start) * 1000 # ms latencies.append(latency) return latencies async def run_concurrent_test(self, concurrent: int 10, total: int 100) - dict: 并发测试 semaphore asyncio.Semaphore(concurrent) async def worker(): async with semaphore: start time.time() await self.client.generate(test prompt) return time.time() - start start_time time.time() results await asyncio.gather(*[worker() for _ in range(total)]) total_time time.time() - start_time return { total_requests: total, total_time: total_time, qps: total / total_time, latencies: [r * 1000 for r in results] }2.2 统计分析import statistics class BenchmarkAnalyzer: def analyze(self, latencies: list) - dict: 分析基准测试结果 sorted_latencies sorted(latencies) return { count: len(latencies), mean: statistics.mean(latencies), median: statistics.median(latencies), p50: self._percentile(sorted_latencies, 50), p95: self._percentile(sorted_latencies, 95), p99: self._percentile(sorted_latencies, 99), min: min(latencies), max: max(latencies) } def _percentile(self, sorted_data: list, p: float) - float: 计算百分位数 idx int(len(sorted_data) * p / 100) return sorted_data[min(idx, len(sorted_data) - 1)]三、性能测试场景3.1 常见场景class TestScenarios: SCENARIOS { simple_chat: { prompt_length: 50, max_tokens: 100, description: 简单问答 }, complex_reasoning: { prompt_length: 500, max_tokens: 500, description: 复杂推理 }, long_context: { prompt_length: 3000, max_tokens: 300, description: 长上下文 } }3.2 场景测试class ScenarioBenchmark: def __init__(self, client): self.client client async def benchmark_scenario(self, scenario: str) - dict: 测试指定场景 config TestScenarios.SCENARIOS[scenario] # 运行测试 latencies await self._run_test( config[prompt_length], config[max_tokens] ) return { scenario: scenario, config: config, results: BenchmarkAnalyzer().analyze(latencies) }四、对比测试4.1 模型对比class ModelComparison: def compare_models(self, models: list, scenario: str) - dict: 对比多个模型 results {} for model in models: client self._create_client(model) result asyncio.run(self._benchmark(client, scenario)) results[model] result return { scenario: scenario, models: results, winner: self._determine_winner(results) }4.2 优化效果验证class OptimizationVerifier: def verify(self, before_result: dict, after_result: dict) - dict: 验证优化效果 improvement { metric: (before_result[metric] - after_result[metric]) / before_result[metric] * 100 for metric in [latency_avg, qps] } return { before: before_result, after: after_result, improvement: improvement, significant: any(abs(v) 10 for v in improvement.values()) }五、性能优化5.1 优化策略class PerformanceOptimizer: STRATEGIES { caching: 缓存重复请求结果, batching: 批处理多个请求, quantization: 模型量化减少计算, hardware: 升级硬件配置 } def suggest_optimizations(self, current_metrics: dict) - list: 建议优化方案 suggestions [] if current_metrics[latency_avg] 1000: suggestions.append({ strategy: caching, expected_improvement: 30-50% }) if current_metrics[qps] 50: suggestions.append({ strategy: batching, expected_improvement: 2-3x }) return suggestions5.2 持续监控class PerformanceMonitor: def __init__(self): self.baseline {} def set_baseline(self, metrics: dict): 设置性能基线 self.baseline metrics def check_health(self, current: dict) - dict: 检查性能健康度 alerts [] for metric, value in current.items(): if metric in self.baseline: threshold self.baseline[metric] * 1.2 # 20% 容忍 if value threshold: alerts.append(f{metric} 超过基线 {threshold}) return { healthy: len(alerts) 0, alerts: alerts }六、最佳实践6.1 测试原则✅代表性数据使用真实场景数据✅稳定环境控制测试环境变量✅足够样本确保统计显著性✅持续测试纳入 CI/CD 流程6.2 优化原则✅测量先行优化前先测量✅渐进优化一次只改一个因素✅验证效果优化后重新测试✅回归测试确保不引入新问题七、总结性能基准测试是优化的基础。关键在于明确指标建立科学的指标体系规范测试使用标准的测试方法持续监控实时监控系统性能数据驱动用数据指导优化决策记住无法测量就无法优化。