apache-kafka - Kafka Broker 索引恢复时间较长，最终关闭

我在 Azure K8S 上有一个 3 代理、无副本的 Kafka 设置，使用 the cp-kafka 5.0.1 helm (使用 the 5.0.1 image )。

在某个时刻(不幸的是我没有日志)，其中一个 Kafka 代理崩溃了，当它重新启动时，它进入了一个无休止的、痛苦的重新启动循环。它似乎正在尝试恢复某些损坏的日志条目，需要花费漫长时间，然后以SIGTERM挂断。更糟糕的是，我无法再完整地消费/生产受影响的主题。下面附有日志，以及显示 Kafka 缓慢浏览日志文件、填充磁盘缓存的监控屏幕截图。

现在，我将 log.retention.bytes 设置为 180GiB - 但我希望保持这种方式，以免 Kafka 陷入无限循环。由于怀疑这可能是旧版本的问题，我在 Kafka JIRA 中搜索了相关关键字( "still starting up" 和 "SIGTERM" "corrupted index file" )，但没有找到任何结果。

因此，我不能依赖较新的版本来解决此问题，并且我不想依赖较小的保留大小，因为这可能会出现大量损坏的日志。

所以我的问题是 - 有没有办法执行以下任何/所有操作:

防止 SIGTERM 发生，从而让 Kafka 完全恢复？
允许在不受影响的分区上恢复消费/生产(似乎 30 个分区中只有 4 个分区有损坏的条目)？
要不然就阻止这种疯狂的事情发生吗？

(如果没有，我将采取:(a) 升级 Kafka；(b) 将 log.retention.bytes 缩小一个数量级；(c) 打开副本，希望这会有所帮助；(d) 改进日志记录以找出导致崩溃的原因。)

日志

日志加载日志已完成，但清理+刷新被中断:

[2019-10-10 00:05:36,562 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,598 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 00:05:37,802 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 00:42:27,037] INFO Logs loading complete in 2210438 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,052] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,054] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,057] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2019-10-10 00:42:27,738] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 00:42:27,763] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)

记录加载中断的位置:

[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,504 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,549 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 01:55:27,123 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 02:17:01,249] INFO [ProducerStateManager partition=my-topic-12] Loading producer state from snapshot file '/opt/kafka/data-0/logs/my-topic-12/00000000000000004443.snapshot' (kafka.log.ProducerStateManager)
[2019-10-10 02:17:07,090] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 02:17:07,093] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2019-10-10 02:17:07,093] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,097] INFO [KafkaServer id=2] shutting down (kafka.server.KafkaServer)
[2019-10-10 02:17:07,105] ERROR [KafkaServer id=2] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
[2019-10-10 02:17:07,110] ERROR Caught exception when trying to shut down KafkaServer. Exiting forcefully. (io.confluent.support.metrics.SupportedServerStartable)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)

监控

最佳答案

我在寻找类似问题的解决方案时发现了您的问题。
请问你这个问题解决了吗？？
与此同时，谁在调用 SIGTERM？可能是 Kubernetes 或其他编排器，您可以修改就绪探针以允许在杀死容器之前进行更多尝试。
还要确保您的 xmx 配置小于 pod/容器分配的资源。否则 Kubernetes 将杀死这个 pod(如果 Kubernetes 是这里的情况)

关于apache-kafka - Kafka Broker 索引恢复时间较长，最终关闭，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58314946/

apache-kafka - Kafka Broker 索引恢复时间较长，最终关闭

日志

监控

上一篇：ajax - 使用ajax更新时无法从summernote获取值

下一篇：r - 比较两个数据帧中的值并返回差异