ELK故障排查指南

2025-05-20 | 阅读：次

故障排查概述

ELK技术栈的故障排查需要系统性的方法和工具。掌握故障排查技巧可以快速定位和解决问题，减少系统 downtime。

故障类型

# 故障类型
trouble_types = {
    "集群问题": ["节点离线", "分片未分配", "脑裂问题"],
    "性能问题": ["慢查询", "高CPU", "高内存", "磁盘IO"],
    "数据问题": ["数据丢失", "数据不一致", "索引损坏"],
    "连接问题": ["无法连接", "认证失败", "网络问题"]
}

常见问题诊断

1. 集群健康问题

集群状态为Red

# 诊断步骤
# 1. 查看集群健康
curl -X GET "localhost:9200/_cluster/health?pretty"

# 2. 查看未分配的分片
curl -X GET "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNED

# 3. 查看节点状态
curl -X GET "localhost:9200/_cat/nodes?v"

# 4. 查看集群设置
curl -X GET "localhost:9200/_cluster/settings?pretty"

# 解决方案
# - 检查节点是否在线
# - 检查磁盘空间
# - 检查网络连接
# - 手动分配分片

集群状态为Yellow

# 诊断步骤
# 1. 查看未分配的副本分片
curl -X GET "localhost:9200/_cat/shards?v" | grep UNASSIGNED

# 2. 检查节点数量
curl -X GET "localhost:9200/_cat/nodes?v"

# 解决方案
# - 增加数据节点
# - 减少副本数量
# - 检查节点配置

2. 节点问题

节点离线

# 诊断步骤
# 1. 检查节点进程
ps aux | grep elasticsearch

# 2. 检查节点日志
tail -f /var/log/elasticsearch/my-cluster.log

# 3. 检查系统资源
free -h
df -h
top

# 4. 检查网络连接
netstat -tulpn | grep 9300

# 解决方案
# - 重启节点
# - 检查JVM内存
# - 检查磁盘空间
# - 检查网络配置

JVM内存问题

# 诊断步骤
# 1. 查看JVM统计
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

# 2. 查看GC日志
tail -f /var/log/elasticsearch/gc.log

# 3. 检查堆内存使用
curl -X GET "localhost:9200/_cat/nodes?v&h=name,heap.percent,heap.current,heap.max"

# 解决方案
# - 增加堆内存
# - 优化查询
# - 减少索引数量
# - 调整GC参数

3. 分片问题

分片未分配

# 诊断步骤
# 1. 查看未分配分片
curl -X GET "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNED

# 2. 查看分片分配解释
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

# 3. 查看集群设置
curl -X GET "localhost:9200/_cluster/settings?pretty"

# 解决方案
# - 启用分片分配
# - 手动分配分片
# - 检查磁盘空间
# - 检查节点配置

分片损坏

# 诊断步骤
# 1. 检查分片状态
curl -X GET "localhost:9200/_cat/shards/my_index?v"

# 2. 尝试打开索引
curl -X POST "localhost:9200/my_index/_open?pretty"

# 解决方案
# - 从副本恢复
# - 从快照恢复
# - 重建索引

性能问题诊断

1. 慢查询

# 诊断步骤
# 1. 启用慢查询日志
PUT /my_index/_settings
{
  "index": {
    "search.slowlog.threshold.query.warn": "10s",
    "search.slowlog.threshold.query.info": "5s"
  }
}

# 2. 查看慢查询日志
GET /my_index/_search
{
  "profile": true,
  "query": {
    "match": {
      "title": "test"
    }
  }
}

# 3. 分析查询性能
GET /my_index/_search
{
  "explain": true,
  "query": {
    "match": {
      "title": "test"
    }
  }
}

# 解决方案
# - 优化查询
# - 使用filter代替query
# - 添加索引
# - 优化映射

2. 高CPU使用

# 诊断步骤
# 1. 查看线程池统计
curl -X GET "localhost:9200/_cat/thread_pool?v"

# 2. 查看热点线程
curl -X GET "localhost:9200/_nodes/hot_threads?pretty"

# 3. 查看节点统计
curl -X GET "localhost:9200/_nodes/stats?pretty"

# 解决方案
# - 优化查询
# - 增加节点
# - 调整线程池大小
# - 减少索引操作

3. 高内存使用

# 诊断步骤
# 1. 查看JVM内存
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

# 2. 查看字段数据缓存
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata?pretty"

# 3. 查看查询缓存
curl -X GET "localhost:9200/_nodes/stats/indices/query_cache?pretty"

# 解决方案
# - 增加堆内存
# - 清理字段数据缓存
# - 优化查询
# - 减少索引数量

Logstash问题诊断

1. 处理延迟

# 诊断步骤
# 1. 查看管道统计
curl -X GET "http://localhost:9600/_node/stats/pipelines?pretty"

# 2. 查看队列大小
curl -X GET "http://localhost:9600/_node/stats?pretty" | grep queue

# 3. 查看处理速率
curl -X GET "http://localhost:9600/_node/stats?pretty" | grep events

# 解决方案
# - 增加工作线程
# - 优化过滤器
# - 增加批量大小
# - 检查输出性能

2. 解析错误

# 诊断步骤
# 1. 查看Logstash日志
tail -f /var/log/logstash/logstash.log

# 2. 检查Grok模式
# 使用Grok Debugger测试模式

# 3. 查看错误统计
curl -X GET "http://localhost:9600/_node/stats?pretty" | grep failures

# 解决方案
# - 修复Grok模式
# - 添加错误处理
# - 使用条件判断

Kibana问题诊断

1. 无法连接Elasticsearch

# 诊断步骤
# 1. 检查Kibana配置
cat /etc/kibana/kibana.yml | grep elasticsearch

# 2. 测试Elasticsearch连接
curl -X GET "http://localhost:9200/?pretty"

# 3. 查看Kibana日志
tail -f /var/log/kibana/kibana.log

# 解决方案
# - 检查网络连接
# - 检查认证配置
# - 检查防火墙规则

2. 查询超时

# 诊断步骤
诊断:
  1: "检查查询复杂度"
  2: "检查数据量"
  3: "检查时间范围"
  4: "检查索引模式"

解决方案:
  - 优化查询
  - 缩小时间范围
  - 使用采样
  - 增加超时时间

故障恢复

1. 数据恢复

# 从快照恢复
POST /_snapshot/my_backup/snapshot_1/_restore
{
  "indices": "my_index",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored_$1"
}

# 从副本恢复
# 如果主分片损坏，可以从副本恢复

2. 集群恢复

# 恢复节点
# 1. 启动节点
systemctl start elasticsearch

# 2. 等待节点加入集群
curl -X GET "localhost:9200/_cat/nodes?v"

# 3. 检查分片分配
curl -X GET "localhost:9200/_cat/shards?v"

3. 索引恢复

# 重建索引
# 1. 创建新索引
PUT /my_index_new

# 2. 使用reindex重建
POST /_reindex
{
  "source": {
    "index": "my_index_old"
  },
  "dest": {
    "index": "my_index_new"
  }
}

# 3. 切换别名
POST /_aliases
{
  "actions": [
    {
      "remove": {
        "index": "my_index_old",
        "alias": "my_index"
      }
    },
    {
      "add": {
        "index": "my_index_new",
        "alias": "my_index"
      }
    }
  ]
}

诊断工具

1. 内置工具

# Elasticsearch诊断工具
# 1. 集群健康API
curl -X GET "localhost:9200/_cluster/health?pretty"

# 2. 节点统计API
curl -X GET "localhost:9200/_nodes/stats?pretty"

# 3. 索引统计API
curl -X GET "localhost:9200/my_index/_stats?pretty"

# 4. 分片分配解释API
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

# 5. 热点线程API
curl -X GET "localhost:9200/_nodes/hot_threads?pretty"

2. 外部工具

# 外部诊断工具
工具:
  - Elasticsearch Head
  - Cerebro
  - ElasticHQ
  - Marvel (X-Pack)
  - Prometheus + Grafana

最佳实践

1. 预防措施

# 预防措施
措施:
  - 完善的监控和告警
  - 定期健康检查
  - 容量规划
  - 性能优化
  - 定期备份
  - 文档记录

2. 故障处理流程

# 故障处理流程
流程:
"发现问题（监控告警）"
"收集信息（日志、指标）"
"定位问题（诊断工具）"
"分析原因"
"制定方案"
"执行修复"
"验证修复"
"记录问题"

总结

ELK故障排查指南的关键要点：

故障类型：集群问题、性能问题、数据问题、连接问题
常见问题诊断：集群健康、节点问题、分片问题
性能问题诊断：慢查询、高CPU、高内存
Logstash问题：处理延迟、解析错误
Kibana问题：连接问题、查询超时
故障恢复：数据恢复、集群恢复、索引恢复
诊断工具：内置工具、外部工具
最佳实践：预防措施、故障处理流程

掌握故障排查技巧，可以快速定位和解决问题，保证ELK系统的稳定运行。

转载请注明：周志洋的博客 » ELK故障排查指南