google-cloud-platform - Apache Beam 中无限 pcollection 的全局窗口的默认行为是什么？

标签 google-cloud-platform apache-beam dataflow

最近看了很多文章，包括官方文档，想了解全局窗口在Apache Beam中是如何工作的。我在 Stackoverflow 中阅读过类似的问题，但我无法理解。

根据官方文档:

You can use the single global window if you are working with an unbounded data set (e.g. from a streaming data source) but use caution when applying aggregating transforms such as GroupByKey and Combine. The single global window with a default trigger generally requires the entire data set to be available before processing, which is not possible with continuously updating data.

因此，全局窗口没有结尾，而且它是全局的，因此很有意义。文档建议在进行聚合时使用非默认触发器，因为默认触发器是在窗口关闭时触发 Pane :

Set a non-default trigger. This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occur.

我对此感到困惑。这里的逻辑是全局窗口将无法将事件触发到管道的下一步，因为它永远不会结束，因此默认触发器永远不会发生。但是，这不是真实场景中发生的情况。如果我从具有全局窗口的无界 PCollection 中读取，事件仍会被推送到下游。

有人可以向我澄清这个问题吗？带有默认触发器的默认全局窗口如何在 Apache Beam 中用于无界 pcollections？我假设它根本不聚合结果，只是在事件到达时一个一个地处理它们。我想确定是否是这种情况。

最佳答案

默认触发器是在水印根据事件时间到达Window 末尾时触发。 GlobalWindow 永远不会发生这种情况，因此如果您使用 GlobalWindow，则永远不会触发默认触发器。

但是如果你设置了一个非默认触发器，例如在处理了一定数量的元素后触发(使用 AfterCount 触发器)，你的元素甚至可以为 全局窗口。参见 here有关 Beam 触发器的更多信息。

关于google-cloud-platform - Apache Beam 中无限 pcollection 的全局窗口的默认行为是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67539285/

上一篇：javascript - HTMLImageElement 'load' 事件指定在哪里？

下一篇：java - Gmail API - 邮件正文包含破折号，无法进行 base64 解码

java - JDBCIO 调用 Postgres 例程(存储过程)，它将自定义对象类型作为参数

sql-server - SSIS连接管理器问题

google-cloud-platform - 无法创建数据流模板，因为 Scrapinghub 客户端库不接受 ValueProvider

node.js - 从另一个 Google Cloud Function 调用 Google Cloud Function 的语法

mysql - 如何优化查询以删除重复的 mySQL？

google-cloud-platform - 如何在 e2-micro GCP VM 上使用嵌套虚拟化？

google-cloud-platform - GCP secret 版本在变得不可见之前会保持销毁状态多长时间？

java - Apache Beam S3 文件系统扩展始终需要 aws 区域输入，即使在我的项目中不使用 AWS 的其他管道中也是如此

.net - TPL 数据流与普通信号量