实时手机检测-通用实战指南使用psutil监控服务资源占用与异常熔断1. 为什么需要监控手机检测服务想象一下这个场景你部署了一个高性能的手机检测服务基于阿里巴巴的DAMO-YOLO模型检测准确率高达88.8%推理速度只要3.83毫秒。服务运行得很顺利每天处理成千上万的图片检测请求。但突然有一天用户开始抱怨服务响应变慢甚至完全无法访问。你登录服务器一看发现内存已经爆满CPU占用率100%服务进程卡死在那里。更糟糕的是你完全不知道问题是什么时候开始的也不知道是什么原因导致的。这就是为什么我们需要监控服务资源占用。一个好的监控系统能让你提前发现问题在服务完全崩溃前发现异常快速定位原因知道是内存泄漏、CPU过载还是其他问题自动恢复服务当检测到异常时能自动重启或熔断优化资源配置根据实际使用情况调整服务器规格今天我就来分享一套实用的监控方案用Python的psutil库来监控你的手机检测服务并实现异常熔断机制。2. 准备工作了解你的手机检测服务在开始监控之前我们先快速回顾一下你的手机检测服务的基本情况。2.1 服务基本信息根据你提供的镜像信息这个手机检测服务有几个关键特点模型名称DAMO-YOLO手机检测模型模型IDdamo/cv_tinynas_object-detection_damoyolo_phone服务端口7860Gradio Web界面启动方式通过app.py或start.sh启动进程特征运行的是Python脚本包含python3 app.py2.2 需要监控的关键指标对于这样的AI推理服务我们需要重点关注几个资源指标CPU使用率模型推理是计算密集型任务CPU使用率会直接反映服务负载内存占用深度学习模型加载后会在内存中驻留需要监控内存泄漏进程状态服务是否在运行是否响应正常端口监听7860端口是否正常监听推理延迟虽然模型标称3.83ms但实际运行中可能会有波动3. 使用psutil进行基础监控psutil是Python中一个非常强大的系统监控库它可以跨平台使用能获取CPU、内存、磁盘、网络、进程等各种系统信息。3.1 安装和基本使用首先确保你的环境中安装了psutilpip install psutil然后创建一个简单的监控脚本# monitor_basic.py import psutil import time def get_service_info(): 获取手机检测服务的基本信息 # 查找服务进程 service_process None for proc in psutil.process_iter([pid, name, cmdline]): try: cmdline proc.info[cmdline] if cmdline and app.py in .join(cmdline): service_process psutil.Process(proc.info[pid]) break except (psutil.NoSuchProcess, psutil.AccessDenied): continue if not service_process: print(❌ 未找到手机检测服务进程) return None # 获取进程详细信息 with service_process.oneshot(): cpu_percent service_process.cpu_percent(interval0.1) memory_info service_process.memory_info() memory_percent service_process.memory_percent() create_time service_process.create_time() status service_process.status() return { pid: service_process.pid, cpu_percent: cpu_percent, memory_mb: memory_info.rss / 1024 / 1024, # 转换为MB memory_percent: memory_percent, status: status, running_time: time.time() - create_time, cmdline: service_process.cmdline() } def get_system_info(): 获取系统整体资源信息 # CPU信息 cpu_percent psutil.cpu_percent(interval0.1) cpu_count psutil.cpu_count() # 内存信息 memory psutil.virtual_memory() # 磁盘信息 disk psutil.disk_usage(/) return { cpu_percent: cpu_percent, cpu_count: cpu_count, memory_total_gb: memory.total / 1024 / 1024 / 1024, memory_used_gb: memory.used / 1024 / 1024 / 1024, memory_percent: memory.percent, disk_total_gb: disk.total / 1024 / 1024 / 1024, disk_used_gb: disk.used / 1024 / 1024 / 1024, disk_percent: disk.percent } def check_port_listening(port7860): 检查服务端口是否在监听 for conn in psutil.net_connections(): if conn.laddr.port port and conn.status LISTEN: return True return False if __name__ __main__: print( 手机检测服务监控 ) # 检查服务进程 service_info get_service_info() if service_info: print(f✅ 服务运行中 (PID: {service_info[pid]})) print(f CPU使用率: {service_info[cpu_percent]:.1f}%) print(f 内存占用: {service_info[memory_mb]:.1f} MB) print(f 进程状态: {service_info[status]}) print(f 运行时间: {service_info[running_time]:.0f} 秒) else: print(❌ 服务未运行) # 检查端口 if check_port_listening(): print(✅ 端口 7860 正常监听) else: print(❌ 端口 7860 未监听) # 系统资源 system_info get_system_info() print(f\n 系统资源 ) print(fCPU使用率: {system_info[cpu_percent]:.1f}% ({system_info[cpu_count]}核)) print(f内存使用: {system_info[memory_used_gb]:.1f} / {system_info[memory_total_gb]:.1f} GB ({system_info[memory_percent]:.1f}%)) print(f磁盘使用: {system_info[disk_used_gb]:.1f} / {system_info[disk_total_gb]:.1f} GB ({system_info[disk_percent]:.1f}%))这个基础监控脚本能帮你快速了解服务的运行状态。运行它你会看到类似这样的输出 手机检测服务监控 ✅ 服务运行中 (PID: 12345) CPU使用率: 15.3% 内存占用: 512.4 MB 进程状态: running 运行时间: 3600 秒 ✅ 端口 7860 正常监听 系统资源 CPU使用率: 25.7% (4核) 内存使用: 3.2 / 8.0 GB (40.0%) 磁盘使用: 50.0 / 100.0 GB (50.0%)4. 实现实时监控与告警基础监控只能看当前状态我们需要的是持续监控和自动告警。下面我设计一个更完善的监控系统。4.1 监控配置设计首先我们需要定义一些监控阈值和配置# config.py class MonitorConfig: 监控配置 # 监控间隔秒 MONITOR_INTERVAL 5 # 告警阈值 ALERT_THRESHOLDS { cpu_percent: 80.0, # CPU使用率超过80%告警 memory_mb: 1024, # 内存超过1GB告警 memory_percent: 70.0, # 内存使用率超过70%告警 response_time: 10.0, # 响应时间超过10秒告警 } # 熔断阈值连续异常次数 CIRCUIT_BREAKER_THRESHOLD 3 # 服务检查超时秒 SERVICE_CHECK_TIMEOUT 30 # 日志配置 LOG_FILE /var/log/phone_detector_monitor.log ALERT_LOG_FILE /var/log/phone_detector_alerts.log # 通知方式可选 ENABLE_EMAIL_ALERT False EMAIL_CONFIG { smtp_server: smtp.example.com, smtp_port: 587, sender: monitorexample.com, receivers: [adminexample.com] }4.2 完整的监控服务现在我们来创建一个完整的监控服务# phone_detector_monitor.py import psutil import time import logging import json from datetime import datetime from typing import Dict, Optional, List import subprocess import threading from config import MonitorConfig class PhoneDetectorMonitor: 手机检测服务监控器 def __init__(self, config: MonitorConfig): self.config config self.service_pid None self.alert_history [] self.abnormal_count 0 self.circuit_breaker_triggered False # 设置日志 self.setup_logging() def setup_logging(self): 配置日志 logging.basicConfig( levellogging.INFO, format%(asctime)s - %(levelname)s - %(message)s, handlers[ logging.FileHandler(self.config.LOG_FILE), logging.StreamHandler() ] ) self.logger logging.getLogger(__name__) # 告警日志单独记录 self.alert_logger logging.getLogger(alert) alert_handler logging.FileHandler(self.config.ALERT_LOG_FILE) alert_handler.setFormatter(logging.Formatter(%(asctime)s - ALERT - %(message)s)) self.alert_logger.addHandler(alert_handler) self.alert_logger.propagate False def find_service_process(self) - Optional[psutil.Process]: 查找手机检测服务进程 for proc in psutil.process_iter([pid, name, cmdline]): try: cmdline proc.info[cmdline] if cmdline: cmdline_str .join(cmdline) # 查找包含app.py的进程 if app.py in cmdline_str and python in cmdline_str: return psutil.Process(proc.info[pid]) except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess): continue return None def check_service_health(self) - Dict: 检查服务健康状态 health_info { timestamp: datetime.now().isoformat(), service_running: False, port_listening: False, metrics: {}, alerts: [] } # 检查进程 process self.find_service_process() if process: self.service_pid process.pid health_info[service_running] True # 获取进程指标 with process.oneshot(): try: cpu_percent process.cpu_percent(interval0.5) memory_info process.memory_info() memory_mb memory_info.rss / 1024 / 1024 memory_percent process.memory_percent() health_info[metrics].update({ pid: process.pid, cpu_percent: cpu_percent, memory_mb: memory_mb, memory_percent: memory_percent, num_threads: process.num_threads(), status: process.status() }) # 检查阈值 if cpu_percent self.config.ALERT_THRESHOLDS[cpu_percent]: alert_msg fCPU使用率过高: {cpu_percent:.1f}% health_info[alerts].append(alert_msg) self.record_alert(high_cpu, alert_msg) if memory_mb self.config.ALERT_THRESHOLDS[memory_mb]: alert_msg f内存占用过高: {memory_mb:.1f} MB health_info[alerts].append(alert_msg) self.record_alert(high_memory, alert_msg) if memory_percent self.config.ALERT_THRESHOLDS[memory_percent]: alert_msg f内存使用率过高: {memory_percent:.1f}% health_info[alerts].append(alert_msg) self.record_alert(high_memory_percent, alert_msg) except (psutil.NoSuchProcess, psutil.AccessDenied): health_info[service_running] False # 检查端口 health_info[port_listening] self.check_port() # 检查系统资源 system_metrics self.get_system_metrics() health_info[system] system_metrics return health_info def check_port(self, port7860) - bool: 检查服务端口 try: for conn in psutil.net_connections(): if conn.laddr.port port and conn.status LISTEN: return True except: pass return False def get_system_metrics(self) - Dict: 获取系统指标 return { cpu_percent: psutil.cpu_percent(interval0.1), memory_percent: psutil.virtual_memory().percent, disk_percent: psutil.disk_usage(/).percent, load_avg: psutil.getloadavg() } def record_alert(self, alert_type: str, message: str): 记录告警 alert { timestamp: datetime.now().isoformat(), type: alert_type, message: message, pid: self.service_pid } self.alert_history.append(alert) self.alert_logger.warning(f{alert_type}: {message}) # 发送邮件通知如果配置了 if self.config.ENABLE_EMAIL_ALERT: self.send_email_alert(alert) def send_email_alert(self, alert: Dict): 发送邮件告警示例 # 这里可以实现邮件发送逻辑 # 为了简化示例这里只打印日志 self.logger.info(f邮件告警已发送: {alert}) def check_service_response(self) - Optional[float]: 检查服务响应时间 try: start_time time.time() # 尝试访问服务健康端点 # 这里假设服务有一个/health端点 result subprocess.run( [curl, -s, -o, /dev/null, -w, %{http_code}, http://localhost:7860, --max-time, 5], capture_outputTrue, textTrue ) response_time time.time() - start_time if result.returncode 0: return response_time else: self.record_alert(service_unreachable, f服务不可达curl返回码: {result.returncode}) return None except Exception as e: self.record_alert(check_failed, f服务检查失败: {str(e)}) return None def restart_service(self): 重启手机检测服务 self.logger.info(尝试重启手机检测服务...) try: # 停止当前服务 if self.service_pid: try: process psutil.Process(self.service_pid) process.terminate() process.wait(timeout10) except: pass # 启动服务 subprocess.Popen( [/root/cv_tinynas_object-detection_damoyolo_phone/start.sh], cwd/root/cv_tinynas_object-detection_damoyolo_phone ) self.logger.info(服务重启命令已执行) self.record_alert(service_restarted, 服务已重启) except Exception as e: self.logger.error(f重启服务失败: {str(e)}) self.record_alert(restart_failed, f重启失败: {str(e)}) def circuit_breaker_check(self, health_info: Dict): 熔断器检查 if health_info[alerts]: self.abnormal_count 1 self.logger.warning(f检测到异常连续异常次数: {self.abnormal_count}) if self.abnormal_count self.config.CIRCUIT_BREAKER_THRESHOLD: if not self.circuit_breaker_triggered: self.logger.error(熔断器触发服务连续异常尝试恢复...) self.circuit_breaker_triggered True self.restart_service() else: # 恢复正常重置计数器 if self.abnormal_count 0: self.logger.info(服务恢复正常重置异常计数器) self.abnormal_count 0 self.circuit_breaker_triggered False def run_monitoring(self): 运行监控循环 self.logger.info(手机检测服务监控启动) while True: try: # 检查服务健康状态 health_info self.check_service_health() # 检查响应时间 response_time self.check_service_response() if response_time: health_info[metrics][response_time] response_time if response_time self.config.ALERT_THRESHOLDS[response_time]: alert_msg f响应时间过长: {response_time:.2f}秒 health_info[alerts].append(alert_msg) self.record_alert(slow_response, alert_msg) # 记录状态 status_msg f服务状态: {运行中 if health_info[service_running] else 未运行}, status_msg f端口: {监听中 if health_info[port_listening] else 未监听}, if response_time in health_info[metrics]: status_msg f响应时间: {health_info[metrics][response_time]:.2f}s self.logger.info(status_msg) # 如果有告警记录详细信息 if health_info[alerts]: for alert in health_info[alerts]: self.logger.warning(f告警: {alert}) # 熔断器检查 self.circuit_breaker_check(health_info) # 保存监控数据可选 self.save_monitoring_data(health_info) except Exception as e: self.logger.error(f监控循环出错: {str(e)}) # 等待下一次检查 time.sleep(self.config.MONITOR_INTERVAL) def save_monitoring_data(self, health_info: Dict): 保存监控数据到文件 try: with open(/tmp/phone_detector_monitor.json, a) as f: f.write(json.dumps(health_info) \n) except: pass def main(): 主函数 config MonitorConfig() monitor PhoneDetectorMonitor(config) # 可以在后台运行监控 monitor_thread threading.Thread(targetmonitor.run_monitoring, daemonTrue) monitor_thread.start() # 保持主线程运行 try: while True: time.sleep(1) except KeyboardInterrupt: print(\n监控服务停止) if __name__ __main__: main()这个监控服务提供了完整的功能实时监控每5秒检查一次服务状态多维度检测检查进程、端口、响应时间智能告警超过阈值自动记录告警熔断机制连续异常自动重启服务日志记录所有监控数据和告警都记录到文件5. 部署和运行监控5.1 创建启动脚本为了让监控服务能随系统启动我们创建一个启动脚本#!/bin/bash # start_monitor.sh # 切换到监控脚本目录 cd /root/phone_detector_monitor # 启动监控服务 nohup python3 phone_detector_monitor.py monitor.log 21 # 保存PID echo $! monitor.pid echo 监控服务已启动PID: $!5.2 创建Systemd服务推荐对于生产环境建议使用systemd来管理监控服务# /etc/systemd/system/phone-detector-monitor.service [Unit] DescriptionPhone Detector Monitor Service Afternetwork.target [Service] Typesimple Userroot WorkingDirectory/root/phone_detector_monitor ExecStart/usr/bin/python3 phone_detector_monitor.py Restartalways RestartSec10 StandardOutputjournal StandardErrorjournal [Install] WantedBymulti-user.target然后启用服务# 重新加载systemd配置 sudo systemctl daemon-reload # 启动监控服务 sudo systemctl start phone-detector-monitor # 设置开机自启 sudo systemctl enable phone-detector-monitor # 查看服务状态 sudo systemctl status phone-detector-monitor # 查看日志 sudo journalctl -u phone-detector-monitor -f5.3 监控数据可视化可选如果你想要更直观地查看监控数据可以添加一个简单的Web界面# monitor_web.py from flask import Flask, render_template, jsonify import json from datetime import datetime, timedelta app Flask(__name__) def read_monitor_data(hours24): 读取最近N小时的监控数据 data [] try: with open(/tmp/phone_detector_monitor.json, r) as f: for line in f: try: record json.loads(line.strip()) record_time datetime.fromisoformat(record[timestamp]) # 只保留最近的数据 if datetime.now() - record_time timedelta(hourshours): data.append(record) except: continue except FileNotFoundError: pass return data[-1000:] # 最多返回1000条记录 app.route(/) def dashboard(): 监控仪表板 data read_monitor_data(hours1) # 最近1小时数据 # 计算统计信息 stats { total_checks: len(data), service_up_time: 0, avg_response_time: 0, alerts_count: 0 } if data: up_count sum(1 for d in data if d.get(service_running, False)) stats[service_up_time] (up_count / len(data)) * 100 response_times [d[metrics].get(response_time, 0) for d in data if metrics in d] if response_times: stats[avg_response_time] sum(response_times) / len(response_times) stats[alerts_count] sum(len(d.get(alerts, [])) for d in data) return render_template(dashboard.html, statsstats, recent_datadata[-10:]) app.route(/api/metrics) def get_metrics(): 获取监控数据API data read_monitor_data(hours24) return jsonify(data) app.route(/api/alerts) def get_alerts(): 获取告警数据 alerts [] try: with open(/var/log/phone_detector_alerts.log, r) as f: for line in f: if ALERT in line: alerts.append(line.strip()) except FileNotFoundError: pass return jsonify(alerts[-50:]) # 返回最近50条告警 if __name__ __main__: app.run(host0.0.0.0, port5000, debugFalse)对应的HTML模板!-- templates/dashboard.html -- !DOCTYPE html html head title手机检测服务监控/title script srchttps://cdn.jsdelivr.net/npm/chart.js/script style body { font-family: Arial, sans-serif; margin: 20px; } .stats { display: flex; gap: 20px; margin-bottom: 30px; } .stat-card { background: #f5f5f5; padding: 20px; border-radius: 8px; flex: 1; text-align: center; } .stat-value { font-size: 24px; font-weight: bold; } .stat-label { color: #666; margin-top: 5px; } table { width: 100%; border-collapse: collapse; } th, td { padding: 10px; text-align: left; border-bottom: 1px solid #ddd; } th { background: #f5f5f5; } .alert { color: #d32f2f; } .ok { color: #388e3c; } /style /head body h1 手机检测服务监控面板/h1 div classstats div classstat-card div classstat-value{{ %.1f|format(stats.service_up_time) }}%/div div classstat-label服务可用率/div /div div classstat-card div classstat-value{{ %.2f|format(stats.avg_response_time) }}s/div div classstat-label平均响应时间/div /div div classstat-card div classstat-value{{ stats.alerts_count }}/div div classstat-label告警数量/div /div div classstat-card div classstat-value{{ stats.total_checks }}/div div classstat-label监控次数/div /div /div h2最近监控记录/h2 table thead tr th时间/th th状态/th thCPU/th th内存/th th响应时间/th th告警/th /tr /thead tbody {% for record in recent_data %} tr td{{ record.timestamp[11:19] }}/td td class{{ ok if record.service_running else alert }} {{ ✅ if record.service_running else ❌ }} /td td{{ %.1f|format(record.metrics.get(cpu_percent, 0)) }}%/td td{{ %.1f|format(record.metrics.get(memory_mb, 0)) }}MB/td td{{ %.2f|format(record.metrics.get(response_time, 0)) }}s/td td {% if record.alerts %} span classalert⚠️ {{ record.alerts|length }}/span {% else %} span classok✓/span {% endif %} /td /tr {% endfor %} /tbody /table div stylemargin-top: 30px; canvas idmetricsChart width800 height300/canvas /div script // 这里可以添加Chart.js图表代码来可视化监控数据 // 由于篇幅限制省略具体实现 /script /body /html6. 监控策略优化建议6.1 根据业务特点调整阈值你的手机检测服务有特定的资源使用模式建议根据实际情况调整监控阈值# 针对手机检测服务的优化配置 class PhoneDetectorConfig(MonitorConfig): 手机检测服务专用配置 # DAMO-YOLO模型通常占用较多内存 ALERT_THRESHOLDS { cpu_percent: 90.0, # 推理时CPU可能较高 memory_mb: 1500, # 模型加载后约占用1.2-1.5GB memory_percent: 80.0, # 留出20%缓冲 response_time: 5.0, # 正常推理应在3.83ms左右5秒是安全阈值 } # 模型加载阶段需要更多时间 SERVICE_CHECK_TIMEOUT 60 # 首次启动可能需要更长时间 # 更频繁的监控 MONITOR_INTERVAL 3 # 3秒检查一次6.2 添加性能基准测试为了更准确地判断服务是否正常可以添加性能基准测试def run_performance_baseline(): 运行性能基准测试 import cv2 import numpy as np # 创建一个测试图像模拟手机图片 test_image np.random.randint(0, 255, (640, 480, 3), dtypenp.uint8) # 保存测试图像 cv2.imwrite(/tmp/test_phone.jpg, test_image) # 测试推理性能 from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks detector pipeline( Tasks.domain_specific_object_detection, modeldamo/cv_tinynas_object-detection_damoyolo_phone, cache_dir/root/ai-models, trust_remote_codeTrue ) # 预热 for _ in range(5): detector(/tmp/test_phone.jpg) # 正式测试 import time times [] for _ in range(10): start time.time() result detector(/tmp/test_phone.jpg) times.append((time.time() - start) * 1000) # 转换为毫秒 avg_time sum(times) / len(times) max_time max(times) min_time min(times) baseline { avg_inference_time_ms: avg_time, max_inference_time_ms: max_time, min_inference_time_ms: min_time, std_deviation: np.std(times), test_timestamp: datetime.now().isoformat() } # 保存基准数据 with open(/tmp/performance_baseline.json, w) as f: json.dump(baseline, f, indent2) return baseline6.3 集成到现有服务你也可以将监控功能集成到现有的手机检测服务中# 在app.py中添加健康检查端点 from flask import Flask, jsonify import psutil import os app Flask(__name__) app.route(/health) def health_check(): 健康检查端点 health_status { status: healthy, timestamp: datetime.now().isoformat(), service: phone-detector, version: 1.0.0, metrics: {} } # 添加进程信息 process psutil.Process(os.getpid()) with process.oneshot(): health_status[metrics].update({ cpu_percent: process.cpu_percent(), memory_mb: process.memory_info().rss / 1024 / 1024, num_threads: process.num_threads(), num_connections: len(process.connections()) }) return jsonify(health_status) app.route(/metrics) def metrics(): Prometheus格式的指标 process psutil.Process(os.getpid()) metrics_text f # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total {process.cpu_times().user process.cpu_times().system} # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes {process.memory_info().rss} # HELP process_threads_total Total number of threads. # TYPE process_threads_total gauge process_threads_total {process.num_threads()} return metrics_text, 200, {Content-Type: text/plain}7. 总结通过这套完整的监控方案你的手机检测服务就具备了7.1 核心监控能力实时资源监控CPU、内存、端口、进程状态智能告警系统超过阈值自动告警支持邮件通知自动熔断恢复连续异常自动重启服务历史数据记录所有监控数据持久化存储可视化仪表板Web界面查看服务状态7.2 部署建议监控服务独立部署不要和检测服务放在同一个容器中配置合理的阈值根据你的服务器配置调整定期检查日志每天查看一次告警日志设置备份机制监控服务本身也要有高可用保障性能基准测试定期运行基准测试了解正常性能水平7.3 后续优化方向集成更多监控工具如Prometheus Grafana添加业务指标监控如检测准确率、吞吐量等实现自动扩缩容根据负载自动调整资源添加预测性维护基于历史数据预测可能的问题这套监控方案不仅能帮你及时发现和解决问题还能让你更深入地了解服务的运行状况为性能优化和容量规划提供数据支持。记住好的监控不是等到出了问题才去看而是让你在问题发生前就能预见并预防。获取更多AI镜像想探索更多AI镜像和应用场景访问 CSDN星图镜像广场提供丰富的预置镜像覆盖大模型推理、图像生成、视频生成、模型微调等多个领域支持一键部署。