distributed - 在 Apache Kafka 中,为什么消费者实例不能多于分区?

标签 distributed apache-kafka

我正在学习 Kafka,请阅读此处的介绍部分

https://kafka.apache.org/documentation.html#introduction

特别是关于消费者的部分。在引言的倒数第二段中,写着

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

我的困惑源于最后一句话,因为在该段落正上方的图像中,作者描述了两个消费者组和一个 4 分区主题,消费者实例比分区多!

消费者实例的数量不能多于分区也是没有意义的,因为那样分区就会非常小,而且为​​每个消费者实例创建新分区的开销似乎会让 Kafka 陷入困境。我知道分区用于容错并减少任何一台服务器上的负载,但上面的句子在分布式系统的上下文中没有意义,该系统应该能够一次处理数千个消费者。

最佳答案

好吧,要理解它,需要理解几个部分。

  1. 为了提供排序总订单,消息只能发送给一个消费者。否则效率会非常低,因为它需要等待所有消费者都收到消息才能发送下一条消息:

However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

Kafka only provides a total order over messages within a partition, not between different partitions in a topic.

此外,您认为的性能损失(多个分区)实际上是性能增益,因为 Kafka 可以完全并行地执行不同分区的操作,同时等待其他分区完成。

  • 该图显示了不同的消费者组,但每个分区最多一个消费者的限制仅限于一个组内。您仍然可以拥有多个消费者组。
  • 首先描述了两种情况:

    If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

    If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

    因此,订阅者组越多,性能越低,因为 kafka 需要将消息复制到所有这些组并保证总​​顺序。

    另一方面,组越少,分区越多,您从并行化消息处理中获得的 yield 就越多。

    关于distributed - 在 Apache Kafka 中,为什么消费者实例不能多于分区?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25896109/

    相关文章:

    apache-kafka - 我们如何使用 INSTANA 监控和管理 Apache Kafka 和 Zookeeper 集群

    java - 在java集群应用程序上创建用户ID

    database - 如何修复无法在分布式事务中启用 Sybase 数据库的错误?

    apache-kafka - toStream() 不适用于窗口化 KTable

    java - 如何从 quarkus 应用程序中正确地将逻辑删除消息发布到压缩的 kafka 主题?

    scala - 使用spark将数据写入cassandra

    python - 为什么 1 行 DataFrame 上的 collect() 使用 2000 个执行器?

    java - 分布式Web应用系统的安全性

    web - 如何在分布式系统中部署akka,实现高可用?

    apache-kafka - Leader 在控制台 Producer 中不可用 Kafka