java - Apache Flink : transforming Broadcast variables fails, 但我无法确定原因

标签 java scala apache-flink

我正在尝试在 Apache Flink 上准备一个小型示例应用程序,主要目的是演示如何使用广播变量。此应用程序读取一个 CSV 文件并准备一个DataSet[BuildingInformation]

case class BuildingInformation(
     buildingID: Int, buildingManager: String, buildingAge: Int,
     productID: String, country: String
)

这就是我目前正在创建 BuildingInformation 数据集的方式:

val buildingsBroadcastSet =
      envDefault
      .fromElements(
                     readBuildingInfo(
                        envDefault,
                        "./SensorFiles/building.csv")
                   )

然后,我开始这样转变:

val hvacStream = readHVACReadings(envDefault,"./SensorFiles/HVAC.csv")

    hvacStream
      .map(new HVACToBuildingMapper)
      .withBroadcastSet(buildingsBroadcastSet,"buildingData")
      .writeAsCsv("./hvacTemp.csv")

(buildingID -> BuildingInformation) 的 map 是我想要的广播引用数据。为了做好准备,我实现了一个 RichMapFunction:

 class HVACToBuildingMapper
    extends RichMapFunction  [HVACData,EnhancedHVACTempReading] {

    var allBuildingDetails: Map[Int, BuildingInformation] = _

    override def open(configuration: Configuration): Unit = {

      allBuildingDetails =
        getRuntimeContext
        .getBroadcastVariableWithInitializer(
          "buildingData",
          new BroadcastVariableInitializer [BuildingInformation,Map[Int,BuildingInformation]] {

            def initializeBroadcastVariable(valuesPushed:java.lang.Iterable[BuildingInformation]): Map[Int,BuildingInformation] = {
              valuesPushed
                .asScala
                .toList
              .map(nextBuilding => (nextBuilding.buildingID,nextBuilding))(breakOut)
            }
          }
        )
    }
    override def map(nextReading: HVACData): EnhancedHVACTempReading = {
      val buildingDetails = allBuildingDetails.getOrElse(nextReading.buildingID,UndefinedBuildingInformation)
      // ... more intermediate data creation logic here
      EnhancedHVACTempReading(
        nextReading.buildingID,
        rangeOfTempRecorded,
        isExtremeTempRecorded,
        buildingDetails.country,
        buildingDetails.productID,
        buildingDetails.buildingAge,
        buildingDetails.buildingManager
      )
    }

  }

在函数签名中

def initializeBroadcastVariable(valuesPushed:java.lang.Iterable[BuildingInformation]): Map[Int,BuildingInformation]

java.lang.Iterable 资格是我的补充。没有这个,编译器会在 Intellij 中报错。

在运行时,应用程序在我从 Iterable[BuildingInformation] 中创建 map 并由框架传递给 open() 函数时失败:

java.lang.Exception: The user defined 'open()' method caused an exception: scala.collection.immutable.$colon$colon cannot be cast to org.nirmalya.hortonworks.tutorial.BuildingInformation
    at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:475)
    at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to org.nirmalya.hortonworks.tutorial.BuildingInformation
    at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7$$anonfun$initializeBroadcastVariable$1.apply(HVACReadingsAnalysis.scala:139)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7.initializeBroadcastVariable(HVACReadingsAnalysis.scala:139)
    at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper$$anon$7.initializeBroadcastVariable(HVACReadingsAnalysis.scala:133)
    at org.apache.flink.runtime.broadcast.BroadcastVariableMaterialization.getVariable(BroadcastVariableMaterialization.java:234)
    at org.apache.flink.runtime.operators.util.DistributedRuntimeUDFContext.getBroadcastVariableWithInitializer(DistributedRuntimeUDFContext.java:84)
    at org.nirmalya.hortonworks.tutorial.HVACReadingsAnalysis$HVACToBuildingMapper.open(HVACReadingsAnalysis.scala:131)
    at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:38)
    at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:471)
    ... 3 more
09:28:54,389 INFO  org.apache.flink.runtime.client.JobClientActor                - 04/29/2016 09:28:54  Job execution switched to status FAILED.

假设这可能是从 (Java) Iterable 转换案例类失败的特定案例(尽管我自己并不相信),我尝试用其所有成员字段的 Tuple5 替换 BuildingInformation。行为没有改变。

我本可以通过提供 CanBuildFrom 来尝试,但我没有这样做。我的想法拒绝了一个简单的案例类不能映射到另一个数据结构。有问题,这对我来说并不明显。

为了完成这篇文章,我尝试了对应于 Scala 2.11.x 和 Scala 2.10.x 的 Flink 版本:行为是相同的。

此外,这里是 EnhancedHVACTempReading(为了更好地理解代码):

case class EnhancedHVACTempReading(buildingID: Int, rangeOfTemp: String, extremeIndicator: Boolean,country: String, productID: String,buildingAge: Int, buildingManager: String)

我有一种预感,JVM 的困惑与 Java 的 Iterable 被用作 Scala 的列表有关,但是,我当然不确定。

谁能帮我找出错误?

最佳答案

问题是您必须在 readBuildingInfomap 函数中返回一些内容。此外,如果您提供了 List[BuildingInformation],则不应使用 fromElements,而如果您想展平列表,则应使用 fromCollection。以下代码片段显示了必要的更改。

def main(args: Array[String]): Unit = {
    val envDefault = ExecutionEnvironment.getExecutionEnvironment

    val buildingsBroadcastSet = readBuildingInfo(envDefault,"./SensorFiles/building.csv")

    val hvacStream = readHVACReadings(envDefault,"./SensorFiles/HVAC.csv")

    hvacStream
      .map(new HVACToBuildingMapper)
      .withBroadcastSet(buildingsBroadcastSet,"buildingData")
      .writeAsCsv("./hvacTemp.csv")

    envDefault.execute("HVAC Simulation")
}

private def readBuildingInfo(env: ExecutionEnvironment, inputPath: String): DataSet[BuildingInformation] = {
    val input = Source.fromFile(inputPath).getLines.drop(1).map(datum => {

      val fields = datum.split(",")
      BuildingInformation(
          fields(0).toInt,     // buildingID
          fields(1),           // buildingManager
          fields(2).toInt,     // buildingAge
          fields(3),           // productID
          fields(4)            // Country
        )
    })
    env.fromCollection(input.toList)
}

关于java - Apache Flink : transforming Broadcast variables fails, 但我无法确定原因,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36929423/

相关文章:

java - 如何在Android Studio中向ExpandableListView添加按钮?

scala - 匹配没有参数列表的案例类

java - 在Android中可以使用哪些编程语言进行开发?

kubernetes - 将环境变量传递给 Flink Kubernetes 集群上的 Flink 作业

gradle - 在shadowJar中没有为属性 'mainClassName'指定值

java - 使用第二张 SIM 卡拨号 (j2me)

java - 在 Maven 中,为什么要运行 'mvn clean' ?

java - 不要在 MapActivity 中加载 map

multithreading - 尝试更改内容时 Scala 摆动面板消失(仅在运行线程时)

hadoop - Flink bucketing sink 以保存点重启导致数据丢失