apache-spark - Apache Spark 流 : accumulate data in memory and output it only much later

标签 apache-spark spark-streaming pyspark apache-flink

如果我理解正确的话，Spark Streaming 是通过一组转换来传输 RDD 批处理，并在转换后进行输出操作。这是针对每个批处理执行的，因此输出操作也是针对每个批处理执行的。但由于每次进行输出的成本太高，我想处理批处理并累积结果，并且仅在某些事件(例如在特定时间段之后)写出累积结果并结束程序。

我知道我可以积累数据，例如与 updateStateByKey 但我不知道如何告诉 Spark 使用输出操作(例如 saveAsTextFiles)，直到很久以后，当某些条件到达时。

这可能吗？

这在 flink 中可能吗？

最佳答案

免责声明:我是 Apache Flink 的贡献者。

由于丰富的窗口语义，使用 Flink 应该可以做到这一点:http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#window-operators Flink 有一堆预定义的窗口。此外，您还可以实现自己的窗口策略，以根据需要获得自定义行为。

关于apache-spark - Apache Spark 流 : accumulate data in memory and output it only much later，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31263993/

上一篇：openstreetmap - osm2pgsql数据转换: lost columns

下一篇：.net - 安全地允许多个客户端共享单个资源

apache-spark - 如何修复 NetworkWordCount Spark Streaming 应用程序中的 "org.apache.spark.shuffle.FetchFailedException: Failed to connect"？

apache-spark - Hadoop数据管道用例

apache-spark - Spark Streaming 批处理之间的数据共享

PySpark reduceByKey 多个值

scala - Spark : NullPointerException when RDD isn't collected before map

apache-spark - Spark + Amazon S3 "s3a://"网址

apache-spark - UDF 原因警告 : CachedKafkaConsumer is not running in UninterruptibleThread (KAFKA-1894)

apache-spark - 使用 pyspark 分层采样

scala - 在 Scala Spark 和 PySpark 之间传递 sparkSession