apache-kafka - Kafka Broker 索引恢复时间较长,最终关闭

标签 apache-kafka kubernetes-helm azure-aks confluent-platform

我在 Azure K8S 上有一个 3 代理、无副本的 Kafka 设置,使用 the cp-kafka 5.0.1 helm (使用 the 5.0.1 image )。

在某个时刻(不幸的是我没有日志),其中一个 Kafka 代理崩溃了,当它重新启动时,它进入了一个无休止的、痛苦的重新启动循环。它似乎正在尝试恢复某些损坏的日志条目,需要花费漫长时间,然后以SIGTERM挂断。更糟糕的是,我无法再完整地消费/生产受影响的主题。下面附有日志,以及显示 Kafka 缓慢浏览日志文件、填充磁盘缓存的监控屏幕截图。

现在,我将 log.retention.bytes 设置为 180GiB - 但我希望保持这种方式,以免 Kafka 陷入无限循环。由于怀疑这可能是旧版本的问题,我在 Kafka JIRA 中搜索了相关关键字( "still starting up""SIGTERM" "corrupted index file" ),但没有找到任何结果。

因此,我不能依赖较新的版本来解决此问题,并且我不想依赖较小的保留大小,因为这可能会出现大量损坏的日志。

所以我的问题是 - 有没有办法执行以下任何/所有操作:

  • 防止 SIGTERM 发生,从而让 Kafka 完全恢复?
  • 允许在不受影响的分区上恢复消费/生产(似乎 30 个分区中只有 4 个分区有损坏的条目)?
  • 要不然就阻止这种疯狂的事情发生吗?

(如果没有,我将采取:(a) 升级 Kafka;(b) 将 log.retention.bytes 缩小一个数量级;(c) 打开副本,希望这会有所帮助;(d) 改进日志记录以找出导致崩溃的原因。)


日志

日志加载日志已完成,但清理+刷新被中断:

[2019-10-10 00:05:36,562 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,598 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 00:05:37,802 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 00:42:27,037] INFO Logs loading complete in 2210438 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,052] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,054] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,057] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2019-10-10 00:42:27,738] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 00:42:27,763] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)  

记录加载中断的位置:

[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,504 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,549 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 01:55:27,123 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 02:17:01,249] INFO [ProducerStateManager partition=my-topic-12] Loading producer state from snapshot file '/opt/kafka/data-0/logs/my-topic-12/00000000000000004443.snapshot' (kafka.log.ProducerStateManager)
[2019-10-10 02:17:07,090] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 02:17:07,093] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2019-10-10 02:17:07,093] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,097] INFO [KafkaServer id=2] shutting down (kafka.server.KafkaServer)
[2019-10-10 02:17:07,105] ERROR [KafkaServer id=2] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
[2019-10-10 02:17:07,110] ERROR Caught exception when trying to shut down KafkaServer. Exiting forcefully. (io.confluent.support.metrics.SupportedServerStartable)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)

监控

Memory consumption log of Kafka throughout crashes

最佳答案

我在寻找类似问题的解决方案时发现了您的问题。
请问你这个问题解决了吗??
与此同时,谁在调用 SIGTERM?可能是 Kubernetes 或其他编排器,您可以修改就绪探针以允许在杀死容器之前进行更多尝试。
还要确保您的 xmx 配置小于 pod/容器分配的资源。否则 Kubernetes 将杀死这个 pod(如果 Kubernetes 是这里的情况)

关于apache-kafka - Kafka Broker 索引恢复时间较长,最终关闭,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58314946/

相关文章:

apache-spark - Spark 流 DirectKafkaInputDStream : kafka data source can easily stress the driver node

nginx - 将 Nginx 日志放入 Kafka 的最佳选择?

kubernetes - 如何执行到初始化容器?

azure - Azure 上的 Aks Creation 需要很长时间才能收到以下错误消息

docker - 使用带有端口映射的 docker-compose 进行本地 Kafka 测试

java - Kafka Producer 似乎无法正常工作

yaml - 如何调试 Helm 图表错误,如 "error converting YAML to JSON: yaml: mapping values are not allowed in this context"?

postgresql - 由于未绑定(bind) PersistentVolumeClaim,SchedulerPredicates 失败

azure - 从同一区域但不同订阅/租户的 AKS 访问存储帐户

elasticsearch - 将Elastic Search安装到Azure Kubernetes Services的方法