从零构建APISIX全链路监控CentOS下PrometheusGrafana实战指南当API网关成为微服务架构的流量中枢如何实时掌握其运行状态就成了运维团队的核心课题。APISIX作为云原生API网关的标杆配合Prometheus和Grafana这套监控黄金组合能构建起从数据采集到可视化的完整观测体系。本文将手把手带你在CentOS系统上搭建这套监控方案重点解决多节点环境下的配置协同问题。1. 基础环境准备1.1 系统初始化配置在开始前建议所有节点执行以下基础优化以CentOS 7为例# 关闭SELinux需重启生效 sed -i s/SELINUXenforcing/SELINUXdisabled/g /etc/selinux/config # 防火墙放行必要端口 firewall-cmd --permanent --add-port{9090/tcp,9091/tcp,3000/tcp,9080/tcp} firewall-cmd --reload # 安装基础工具链 yum install -y wget vim net-tools epel-release提示生产环境建议保留SELinux需额外配置策略规则。本文为简化流程临时关闭。1.2 APISIX集群部署采用官方推荐的RPM包方式安装以节点1为例# 添加APISIX仓库 sudo yum install -y https://repos.apiseven.com/packages/centos/apache-apisix-repo-1.0-1.noarch.rpm # 安装ETCD和APISIX sudo yum install -y etcd apisix # 启动服务 sudo systemctl start etcd sudo systemctl enable etcd sudo systemctl start apisix sudo systemctl enable apisix集群节点需确保以下配置一致etcd集群地址/etc/apisix/conf.yaml节点间时间同步chrony或ntpd相同的插件配置策略2. 监控组件部署2.1 Prometheus的智能安装使用以下脚本完成Prometheus的自动化部署#!/bin/bash PROM_VERSION2.51.0 INSTALL_DIR/opt/monitoring mkdir -p ${INSTALL_DIR} wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz tar xvf prometheus-*.tar.gz -C ${INSTALL_DIR} ln -s ${INSTALL_DIR}/prometheus-${PROM_VERSION}.linux-amd64 ${INSTALL_DIR}/prometheus # 创建系统服务 cat /etc/systemd/system/prometheus.service EOF [Unit] DescriptionPrometheus Wantsnetwork-online.target Afternetwork-online.target [Service] Userroot ExecStart${INSTALL_DIR}/prometheus/prometheus \ --config.file${INSTALL_DIR}/prometheus/prometheus.yml \ --storage.tsdb.path${INSTALL_DIR}/prometheus/data \ --web.listen-address:9090 \ --web.enable-lifecycle Restartalways [Install] WantedBymulti-user.target EOF systemctl daemon-reload systemctl start prometheus systemctl enable prometheus关键配置优化项配置参数推荐值说明scrape_interval15s采集频率evaluation_interval15s规则评估频率retention.time30d数据保留周期2.2 Grafana的企业级配置推荐使用Grafana官方企业版包含更多数据源插件cat /etc/yum.repos.d/grafana.repo EOF [grafana] namegrafana baseurlhttps://packages.grafana.com/enterprise/rpm repo_gpgcheck1 enabled1 gpgcheck1 gpgkeyhttps://packages.grafana.com/gpg.key sslverify1 sslcacert/etc/pki/tls/certs/ca-bundle.crt EOF yum install -y grafana-enterprise systemctl start grafana-server systemctl enable grafana-server安全加固建议修改默认3000端口/etc/grafana/grafana.ini强制启用HTTPS配置LDAP/AD集成认证3. 全链路监控配置3.1 APISIX的指标暴露配置在每台APISIX节点配置Prometheus插件# /usr/local/apisix/conf/config.yaml plugin_attr: prometheus: export_addr: ip: 0.0.0.0 port: 9091 metrics: - name: http_requests_total type: counter desc: Total number of HTTP requests - name: http_request_duration_seconds type: histogram desc: HTTP request duration in seconds验证指标输出curl http://localhost:9091/apisix/prometheus/metrics | grep apisix_3.2 Prometheus的智能服务发现避免手动维护targets列表改用文件服务发现# prometheus.yml scrape_configs: - job_name: apisix-cluster metrics_path: /apisix/prometheus/metrics file_sd_configs: - files: - /opt/monitoring/targets/apisix*.json refresh_interval: 5m动态目标文件示例# /opt/monitoring/targets/apisix-nodes.json [ { targets: [node1:9091], labels: { env: production, role: gateway } } ]3.3 Grafana的高级可视化导入官方仪表板后建议添加以下自定义面板流量热力图展示不同路由的请求分布异常检测面板基于PromQL的异常检测算法黄金指标看板请求率sum(rate(apisix_http_requests_total[1m])) by (service)错误率sum(rate(apisix_http_requests_total{status~5..}[1m])) by (service) / sum(rate(apisix_http_requests_total[1m])) by (service)延迟P99histogram_quantile(0.99, sum(rate(apisix_http_request_duration_seconds_bucket[1m])) by (le, service))4. 生产级优化技巧4.1 性能调优参数关键内核参数调整/etc/sysctl.confnet.core.somaxconn 32768 net.ipv4.tcp_max_syn_backlog 8192 net.ipv4.tcp_tw_reuse 1 vm.swappiness 10APISIX专属优化nginx_config: worker_processes: auto worker_connections: 20480 keepalive_timeout: 60s4.2 高可用部署架构推荐的多节点部署方案----------------- | Load Balancer | ---------------- | -------------------------------- | | | ----------- ----------- ----------- | APISIX-1 | | APISIX-2 | | APISIX-3 | ----------- ----------- ----------- | | | ----------- ----------- ----------- | Prometheus | | Prometheus | | Grafana | ------------ ------------ -----------4.3 告警规则配置示例关键告警规则prometheus.rules.ymlgroups: - name: apisix-alerts rules: - alert: HighErrorRate expr: sum(rate(apisix_http_requests_total{status~5..}[1m])) by (service) / sum(rate(apisix_http_requests_total[1m])) by (service) 0.05 for: 5m labels: severity: critical annotations: summary: High error rate on {{ $labels.service }} description: Error rate is {{ $value }}与Alertmanager集成后可实现多渠道告警推送邮件、Slack、Webhook等。