apache-spark - 列数据到Spark结构化流中的嵌套json对象

标签 apache-spark elasticsearch spark-structured-streaming

在我们的应用程序中,我们使用Spark sql获取字段值作为列。我正在尝试弄清楚如何将列值放入嵌套的json对象并推送到Elasticsearch。还有一种方法可以参数化selectExpr中的值以传递给正则表达式?

我们目前正在使用Spark Java API。

Dataset<Row> data = rowExtracted.selectExpr("split(value,\"[|]\")[0] as channelId",
                "split(value,\"[|]\")[1] as country",
                "split(value,\"[|]\")[2] as product",
                "split(value,\"[|]\")[3] as sourceId",
                "split(value,\"[|]\")[4] as systemId",
                "split(value,\"[|]\")[5] as destinationId",
                "split(value,\"[|]\")[6] as batchId",
                "split(value,\"[|]\")[7] as orgId",
                "split(value,\"[|]\")[8] as businessId",
                "split(value,\"[|]\")[9] as orgAccountId",
                "split(value,\"[|]\")[10] as orgBankCode",
                "split(value,\"[|]\")[11] as beneAccountId",
                "split(value,\"[|]\")[12] as beneBankId",
                "split(value,\"[|]\")[13] as currencyCode",
                "split(value,\"[|]\")[14] as amount",
                "split(value,\"[|]\")[15] as processingDate",
                "split(value,\"[|]\")[16] as status",
                "split(value,\"[|]\")[17] as rejectCode",
                "split(value,\"[|]\")[18] as stageId",
                "split(value,\"[|]\")[19] as stageStatus",
                "split(value,\"[|]\")[20] as stageUpdatedTime",
                "split(value,\"[|]\")[21] as receivedTime",
                "split(value,\"[|]\")[22] as sendTime"
        );
StreamingQuery query = data.writeStream()
                .outputMode(OutputMode.Append()).format("es").option("checkpointLocation", "C:\\checkpoint")
                .start("spark_index/doc")

实际输出:
{
  "_index": "spark_index",
  "_type": "doc",
  "_id": "test123",
  "_version": 1,
  "_score": 1,
  "_source": {
    "channelId": "test",
    "country": "SG",
    "product": "test",
    "sourceId": "",
    "systemId": "test123",
    "destinationId": "",
    "batchId": "",
    "orgId": "test",
    "businessId": "test",
    "orgAccountId": "test",
    "orgBankCode": "",
    "beneAccountId": "test",
    "beneBankId": "test",
    "currencyCode": "SGD",
    "amount": "53.0000",
    "processingDate": "",
    "status": "Pending",
    "rejectCode": "test",
    "stageId": "123",
    "stageStatus": "Comment",
    "stageUpdatedTime": "2019-08-05 18:11:05.999000",
    "receivedTime": "2019-08-05 18:10:12.701000",
    "sendTime": "2019-08-05 18:11:06.003000"
  }
}

我们需要在节点“txn_summary”下的上述列,例如以下json:

预期产量:
{
  "_index": "spark_index",
  "_type": "doc",
  "_id": "test123",
  "_version": 1,
  "_score": 1,
  "_source": {
    "txn_summary": {
      "channelId": "test",
      "country": "SG",
      "product": "test",
      "sourceId": "",
      "systemId": "test123",
      "destinationId": "",
      "batchId": "",
      "orgId": "test",
      "businessId": "test",
      "orgAccountId": "test",
      "orgBankCode": "",
      "beneAccountId": "test",
      "beneBankId": "test",
      "currencyCode": "SGD",
      "amount": "53.0000",
      "processingDate": "",
      "status": "Pending",
      "rejectCode": "test",
      "stageId": "123",
      "stageStatus": "Comment",
      "stageUpdatedTime": "2019-08-05 18:11:05.999000",
      "receivedTime": "2019-08-05 18:10:12.701000",
      "sendTime": "2019-08-05 18:11:06.003000"
    }
  }
}

最佳答案

将所有列添加到顶层结构应提供预期的输出。在Scala中:

import org.apache.spark.sql.functions._

data.select(struct(data.columns:_*).as("txn_summary"))

在Java中,我怀疑是这样的:

import org.apache.spark.sql.functions.struct;

data.select(struct(data.columns()).as("txn_summary"));

关于apache-spark - 列数据到Spark结构化流中的嵌套json对象,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57368924/

相关文章:

java - 如何使用 Java Spark 结构化流从 Kafka 主题正确消费

scala - 如何从AWS SQS读取流数据集?

scala - 超出物理限制运行的 Spark 容器

scala - 使用 Scala 在 Apache Spark 中将矩阵转换为 RowMatrix

elasticsearch - Elasticsearch仅找到完美匹配

elasticsearch - Elasticsearch 添加脚本字段与添加新索引字段

elasticsearch - 在Elasticsearch中过滤折叠的结果

java - 如何从 Spark 结构化流获取 Kafka 输出中的批处理 ID

python - Spark DataFrame TimestampType - 如何从字段中获取年、月、日值?

scala - 在 spark 中读取 csv 文件时出现 ArrayIndexOutOfBoundsException