ELK监控与告警

2025-05-12 | 阅读：次

监控体系概述

ELK技术栈的监控体系包括对Elasticsearch、Logstash、Kibana以及底层系统的全面监控。完善的监控体系可以及时发现和解决问题，保证系统稳定运行。

监控维度

# 监控维度
monitoring_dimensions = {
    "性能监控": ["响应时间", "吞吐量", "资源使用率"],
    "健康监控": ["集群状态", "节点状态", "服务可用性"],
    "错误监控": ["错误率", "异常日志", "失败请求"],
    "容量监控": ["存储使用", "索引大小", "分片数量"]
}

Elasticsearch监控

1. 集群健康监控

# 集群健康检查
curl -X GET "localhost:9200/_cluster/health?pretty"

# 健康状态说明
# green: 所有主分片和副本分片都正常
# yellow: 所有主分片正常，部分副本分片未分配
# red: 部分主分片未分配

# 详细健康信息
curl -X GET "localhost:9200/_cluster/health?level=indices&pretty"

2. 节点监控

# 节点统计
curl -X GET "localhost:9200/_nodes/stats?pretty"

# 节点信息
curl -X GET "localhost:9200/_nodes?pretty"

# JVM统计
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

# 线程池统计
curl -X GET "localhost:9200/_nodes/stats/thread_pool?pretty"

3. 索引监控

# 索引统计
curl -X GET "localhost:9200/_cat/indices?v&s=store.size:desc"

# 详细索引统计
curl -X GET "localhost:9200/my_index/_stats?pretty"

# 分片统计
curl -X GET "localhost:9200/_cat/shards?v"

# 未分配分片
curl -X GET "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNED

4. 性能监控

# 集群性能统计
curl -X GET "localhost:9200/_cluster/stats?pretty"

# 索引性能
curl -X GET "localhost:9200/my_index/_stats/search,indexing?pretty"

# 慢查询日志
GET /my_index/_search
{
  "profile": true,
  "query": {
    "match": {
      "title": "test"
    }
  }
}

Logstash监控

1. 节点监控

# Logstash节点统计
curl -X GET "http://localhost:9600/_node/stats?pretty"

# 管道统计
curl -X GET "http://localhost:9600/_node/stats/pipelines?pretty"

# 插件统计
curl -X GET "http://localhost:9600/_node/stats/plugins?pretty"

2. 性能监控

# Logstash性能指标
性能指标:
  处理速率: "events_per_second"
  队列大小: "queue_size_in_bytes"
  错误率: "failures"
  CPU使用: "cpu_percent"
  内存使用: "mem_total_virtual_in_bytes"

3. 健康检查

# Logstash健康检查端点
# 在logstash.conf中添加
input {
  http {
    port => 8080
    codec => json
  }
}

filter {
  if [path] == "/health" {
    mutate {
      add_field => {
        "status" => "healthy"
        "timestamp" => "%{@timestamp}"
      }
    }
  }
}

output {
  if [path] == "/health" {
    http {
      url => "http://localhost:8080/health"
      http_method => "get"
    }
  }
}

Kibana监控

1. 状态监控

# Kibana状态API
curl -X GET "http://localhost:5601/api/status?pretty"

# 健康检查
curl -X GET "http://localhost:5601/api/status/v1/health?pretty"

2. 性能监控

# Kibana性能指标
性能指标:
  响应时间: "请求响应时间"
  错误率: "HTTP错误率"
  连接状态: "Elasticsearch连接状态"
  内存使用: "Node.js内存使用"

监控工具

1. Elastic Stack Monitoring

# Elastic Stack内置监控
监控组件:
  Metricbeat: "收集系统和应用指标"
  Filebeat: "收集日志"
  Monitoring UI: "Kibana监控界面"
  
配置:
  - 启用X-Pack监控
  - 配置Metricbeat收集指标
  - 在Kibana中查看监控数据

2. Prometheus监控

# Prometheus监控配置
# 使用elasticsearch_exporter
exporter配置:
  - 安装elasticsearch_exporter
  - 配置Prometheus抓取
  - 配置Grafana仪表板

3. 自定义监控

# Python自定义监控脚本
import requests
import json
from datetime import datetime

def check_cluster_health():
    """检查集群健康状态"""
    url = "http://localhost:9200/_cluster/health"
    response = requests.get(url)
    data = response.json()
    
    if data['status'] == 'red':
        send_alert("集群状态为红色！")
    elif data['status'] == 'yellow':
        send_warning("集群状态为黄色")
    
    return data

def check_node_status():
    """检查节点状态"""
    url = "http://localhost:9200/_cat/nodes?v"
    response = requests.get(url)
    # 解析节点状态
    return response.text

def send_alert(message):
    """发送告警"""
    print(f"[ALERT] {datetime.now()}: {message}")
    # 发送到告警系统（邮件、短信、钉钉等）

告警配置

1. Watcher告警

// Elasticsearch Watcher配置
PUT _watcher/watch/cluster_health_alert
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "http": {
      "request": {
        "scheme": "http",
        "host": "localhost",
        "port": 9200,
        "path": "/_cluster/health"
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.status": {
        "eq": "red"
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "to": ["admin@example.com"],
        "subject": "Elasticsearch集群告警",
        "body": "集群状态为红色，请立即检查！"
      }
    }
  }
}

2. 告警规则

# 告警规则配置
告警规则:
  集群健康:
    - 状态为red: "立即告警"
    - 状态为yellow: "警告"
    - 节点离线: "告警"
  
  性能指标:
    - JVM使用率>80%: "警告"
    - JVM使用率>90%: "告警"
    - 磁盘使用率>85%: "警告"
    - 磁盘使用率>95%: "告警"
  
  错误率:
    - 错误率>5%: "警告"
    - 错误率>10%: "告警"

3. 告警通知

# 告警通知方式
通知方式:
  邮件: "发送邮件告警"
  短信: "发送短信告警"
  钉钉: "发送钉钉消息"
  企业微信: "发送企业微信消息"
  Webhook: "调用Webhook接口"
  PagerDuty: "集成PagerDuty"

监控仪表板

1. Elasticsearch监控仪表板

# Elasticsearch监控仪表板
仪表板组件:
  集群健康:
    类型: Metric
    指标: 集群状态
  
  节点状态:
    类型: Data Table
    字段: 节点名称、状态、角色
  
  JVM使用率:
    类型: Line Chart
    指标: JVM堆内存使用率
    分组: 按节点
  
  索引统计:
    类型: Data Table
    字段: 索引名、文档数、大小
    排序: 大小 desc
  
  分片状态:
    类型: Pie Chart
    分组: 分片状态

2. Logstash监控仪表板

# Logstash监控仪表板
仪表板组件:
  处理速率:
    类型: Line Chart
    指标: events_per_second
  
  队列大小:
    类型: Line Chart
    指标: queue_size_in_bytes
  
  错误率:
    类型: Line Chart
    指标: failures
  
  插件性能:
    类型: Data Table
    字段: 插件名、处理时间

3. 系统监控仪表板

# 系统监控仪表板
仪表板组件:
  CPU使用率:
    类型: Line Chart
    指标: system.cpu.total.pct
  
  内存使用率:
    类型: Line Chart
    指标: system.memory.used.pct
  
  磁盘使用率:
    类型: Line Chart
    指标: system.diskio.used.pct
  
  网络流量:
    类型: Area Chart
    指标: 
      - system.network.in.bytes
      - system.network.out.bytes

最佳实践

1. 监控策略

# 监控策略
策略:
  关键指标: "重点监控关键指标"
  告警阈值: "合理设置告警阈值"
  告警频率: "避免告警风暴"
  告警分级: "区分告警级别"

2. 监控优化

# 监控优化
优化建议:
  采样: "大数据集使用采样"
  聚合: "使用聚合减少数据量"
  索引: "监控数据单独索引"
  清理: "定期清理监控数据"

总结

ELK监控与告警的关键要点：

监控体系：监控维度、监控工具
Elasticsearch监控：集群健康、节点监控、索引监控、性能监控
Logstash监控：节点监控、性能监控、健康检查
Kibana监控：状态监控、性能监控
监控工具：Elastic Stack监控、Prometheus、自定义监控
告警配置：Watcher告警、告警规则、告警通知
监控仪表板：Elasticsearch、Logstash、系统监控仪表板
最佳实践：监控策略、监控优化

掌握监控与告警，可以及时发现和解决问题，保证ELK系统的稳定运行。

转载请注明：周志洋的博客 » ELK监控与告警