我在 Azure K8S 上有一个 3 代理、无副本的 Kafka 设置,使用 the cp-kafka 5.0.1 helm (使用 the 5.0.1 image )。
在某个时刻(不幸的是我没有日志),其中一个 Kafka 代理崩溃了,当它重新启动时,它进入了一个无休止的、痛苦的重新启动循环。它似乎正在尝试恢复某些损坏的日志条目,需要花费漫长时间,然后以SIGTERM
挂断。更糟糕的是,我无法再完整地消费/生产受影响的主题。下面附有日志,以及显示 Kafka 缓慢浏览日志文件、填充磁盘缓存的监控屏幕截图。
现在,我将 log.retention.bytes
设置为 180GiB - 但我希望保持这种方式,以免 Kafka 陷入无限循环。由于怀疑这可能是旧版本的问题,我在 Kafka JIRA 中搜索了相关关键字( "still starting up" 和 "SIGTERM" "corrupted index file" ),但没有找到任何结果。
因此,我不能依赖较新的版本来解决此问题,并且我不想依赖较小的保留大小,因为这可能会出现大量损坏的日志。
所以我的问题是 - 有没有办法执行以下任何/所有操作:
- 防止 SIGTERM 发生,从而让 Kafka 完全恢复?
- 允许在不受影响的分区上恢复消费/生产(似乎 30 个分区中只有 4 个分区有损坏的条目)?
- 要不然就阻止这种疯狂的事情发生吗?
(如果没有,我将采取:(a) 升级 Kafka;(b) 将 log.retention.bytes
缩小一个数量级;(c) 打开副本,希望这会有所帮助;(d) 改进日志记录以找出导致崩溃的原因。)
日志
日志加载日志已完成,但清理+刷新被中断:
[2019-10-10 00:05:36,562 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,598 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 00:05:37,802 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 00:42:27,037] INFO Logs loading complete in 2210438 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,052] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,054] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,057] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2019-10-10 00:42:27,738] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 00:42:27,763] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
记录加载中断的位置:
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,504 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,549 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 01:55:27,123 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 02:17:01,249] INFO [ProducerStateManager partition=my-topic-12] Loading producer state from snapshot file '/opt/kafka/data-0/logs/my-topic-12/00000000000000004443.snapshot' (kafka.log.ProducerStateManager)
[2019-10-10 02:17:07,090] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 02:17:07,093] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2019-10-10 02:17:07,093] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,097] INFO [KafkaServer id=2] shutting down (kafka.server.KafkaServer)
[2019-10-10 02:17:07,105] ERROR [KafkaServer id=2] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
[2019-10-10 02:17:07,110] ERROR Caught exception when trying to shut down KafkaServer. Exiting forcefully. (io.confluent.support.metrics.SupportedServerStartable)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
监控
最佳答案
我在寻找类似问题的解决方案时发现了您的问题。
请问你这个问题解决了吗??
与此同时,谁在调用 SIGTERM?可能是 Kubernetes 或其他编排器,您可以修改就绪探针以允许在杀死容器之前进行更多尝试。
还要确保您的 xmx 配置小于 pod/容器分配的资源。否则 Kubernetes 将杀死这个 pod(如果 Kubernetes 是这里的情况)
关于apache-kafka - Kafka Broker 索引恢复时间较长,最终关闭,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58314946/