scala - 如何在 Spark/Scala 中使用爆炸

标签 scala apache-spark apache-spark-sql

这是我的父数据框。

+------------------+-------------------------+---------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+--------------------------+----------------------+-----------------------------+----------+------------+-------------+-------------------------+--------------------------+----------------------------+-----------------+---------------+-------------------------+-----------------------+-------------------------+---------------------------+-----------+
|DataPartition     |TimeStamp                |_organizationId|_sourceId|sr:Auditors                                                                                                                                                         |sr:CapitalChangeAdjustmentDate|sr:ContainsPreliminaryData|sr:ContainsRestatement|sr:CumulativeAdjustmentFactor|sr:Dcn    |sr:DocFormat|sr:DocumentId|sr:FilingDateTime        |sr:FilingDateTimeUTCOffset|sr:IsFilingDateTimeEstimated|sr:SourceTypeCode|sr:SourceTypeId|sr:StatementDate         |sr:ThirdPartySourceCode|sr:ThirdPartySourceCodeId|sr:ThirdPartySourcePriority|FFAction|!||
+------------------+-------------------------+---------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+--------------------------+----------------------+-----------------------------+----------+------------+-------------+-------------------------+--------------------------+----------------------------+-----------------+---------------+-------------------------+-----------------------+-------------------------+---------------------------+-----------+
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|4298009288     |80       |[WrappedArray([16165,null,UWE,3010547,3020538,true,false,true])]                                                                                                    |2017-07-31T00:00:00+00:00     |false                     |false                 |1.0                          |171105584 |ASFILED     |null         |2017-09-28T23:00:00+00:00|-300                      |false                       |10K              |3011835        |2017-07-31T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-21T12:09:23+00:00|4295904170     |364      |null                                                                                                                                                                |2017-07-30T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-08-08T17:00:00+00:00|-300                      |false                       |10Q              |3011836        |2017-07-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-21T12:09:23+00:00|4295904170     |365      |[WrappedArray([3541,3024068,UNQ,3010546,null,true,true,false])]                                                                                                     |2017-09-30T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-10-10T17:00:00+00:00|-300                      |false                       |10K              |3011835        |2017-09-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-21T12:17:49+00:00|4295904170     |365      |[WrappedArray([3541,3024068,UNQ,3010546,null,true,true,false])]                                                                                                     |2017-09-30T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-10-10T17:00:00+00:00|-300                      |false                       |10K              |3011835        |2017-09-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-21T12:18:55+00:00|4295904170     |364      |null                                                                                                                                                                |2017-07-30T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-08-08T17:00:00+00:00|-300                      |false                       |10Q              |3011836        |2017-07-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-21T12:18:55+00:00|4295904170     |365      |[WrappedArray([3541,3024068,UNQ,3010546,null,true,true,false])]                                                                                                     |2017-09-30T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-10-10T17:00:00+00:00|-300                      |false                       |10K              |3011835        |2017-09-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-03T12:30:00+00:00|4295858941     |10       |null                                                                                                                                                                |2016-03-31T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2016-04-04T12:00:00+00:00|0                         |false                       |AUTINR           |3011930        |2016-03-31T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|Japan             |2017-04-25T07:34:37+00:00|4295876606     |157      |null                                                                                                                                                                |2016-09-30T00:00:00+00:00     |true                      |false                 |1.0                          |tn01037828|PDFNTV      |42149999     |2016-11-11T06:40:00+00:00|540                       |false                       |TAN              |3012925        |2016-09-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|Japan             |2017-04-25T07:34:37+00:00|4295876606     |158      |null                                                                                                                                                                |2016-09-30T00:00:00+00:00     |false                     |false                 |1.0                          |yo00185376|PDFNTV      |42148210     |2016-11-11T06:41:00+00:00|540                       |false                       |YUH              |3013057        |2016-09-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|Japan             |2017-04-25T07:34:37+00:00|4295876606     |159      |null                                                                                                                                                                |2016-12-31T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-01-01T03:00:00+00:00|540                       |false                       |JPNINR           |3012418        |2016-12-31T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-04-20T07:14:46+00:00|4296803503     |1        |null                                                                                                                                                                |2016-03-31T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2016-04-04T17:00:00+00:00|-300                      |false                       |10Q              |3011836        |2016-03-31T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|ThirdPartyPrivate |2017-04-19T09:49:23+00:00|4296803503     |1        |null                                                                                                                                                                |2016-03-31T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2016-04-04T17:00:00+00:00|-300                      |false                       |10Q              |3011836        |2016-03-31T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-05-04T09:46:05+00:00|4295907168     |446      |null                                                                                                                                                                |2016-06-30T00:00:00+00:00     |true                      |false                 |1.0                          |161773588 |ASFILED     |41178144     |2016-07-19T16:00:00+00:00|-240                      |false                       |8K0              |3011852        |2016-06-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-05-04T09:46:05+00:00|4295907168     |447      |[WrappedArray([2815,3023351,UNQ,3010546,null,true,false,false])]                                                                                                    |2016-06-30T00:00:00+00:00     |false                     |false                 |1.0                          |161790278 |ASFILED     |41254569     |2016-07-28T17:00:00+00:00|-300                      |false                       |10K              |3011835        |2016-06-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-05-04T09:46:05+00:00|4295907168     |450      |null                                                                                                                                                                |null                          |false                     |false                 |1.0                          |null      |null        |null         |2016-06-30T00:00:00+00:00|0                         |true                        |ESGWEB           |1002198005     |2016-06-30T00:00:00+00:00|ATD                    |null                     |null                       |I|!|       |
|SelfSourcedPublic |2017-05-04T09:46:05+00:00|4295907168     |461      |null                                                                                                                                                                |2016-06-30T00:00:00+00:00     |false                     |false                 |1.0                          |161940929 |ASFILED     |41914564     |2016-10-18T17:00:00+00:00|-300                      |false                       |14A              |3011840        |2016-06-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-22T07:56:09+00:00|4295906830     |344      |[WrappedArray([3541,3024068,UNQ,3010546,null,true,false,false], [9574,3030421,UWE,3010547,null,true,false,false])]                                                  |2017-10-30T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-11-11T17:00:00+00:00|-300                      |false                       |10K              |3011835        |2017-10-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2018-01-11T10:00:22+00:00|5045881237     |15       |[WrappedArray([2719,3023331,UNQ,3010546,null,true,false,false], [5937,3026578,UWE,3010547,null,true,false,false], [8348,3029157,UNQ,3010546,null,true,false,false])]|2017-12-31T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2018-01-01T04:00:00+00:00|480                       |false                       |MYSARS           |3012622        |2017-12-31T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-22T07:56:09+00:00|4295906830     |344      |[WrappedArray([3541,3024068,UNQ,3010546,null,true,false,false], [9574,3030421,UWE,3010547,null,true,false,false])]                                                  |2017-10-30T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-11-11T17:00:00+00:00|-300                      |false                       |10K              |3011835        |2017-10-30T00:00:00+00:00|SS                     |1000716240               |1                          |I|!|       |
|SelfSourcedPublic |2017-11-28T07:13:51+00:00|4295859031     |59       |null                                                                                                                                                                |2017-03-31T00:00:00+00:00     |false                     |false                 |1.0                          |null      |null        |null         |2017-04-04T12:00:00+00:00|-360                      |false                       |10Q              |null           |2017-03-31T00:00:00+00:00|SS                     |null                     |null                       |D|!|       |
+------------------+-------------------------+---------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+--------------------------+----------------------+-----------------------------+----------+------------+-------------+-------------------------+--------------------------+----------------------------+-----------------+---------------+-------------------------+-----------------------+-------------------------+---------------------------+-----------+

除此之外,我想为 sr:Auditor 创建一个数据框,其中包含上面数据框的四列(DataPatition、TimeStamp、_organizationId、_sourceId)。

我该怎么做?

这是数据框的架构

  root
 |-- _al: string (nullable = true)
 |-- _cr: string (nullable = true)
 |-- _env: string (nullable = true)
 |-- _fl: string (nullable = true)
 |-- _fs: string (nullable = true)
 |-- _fsb: string (nullable = true)
 |-- _ful: string (nullable = true)
 |-- _fun: string (nullable = true)
 |-- _ir: string (nullable = true)
 |-- _majVers: long (nullable = true)
 |-- _md: string (nullable = true)
 |-- _minVers: double (nullable = true)
 |-- _pe: string (nullable = true)
 |-- _pubStyle: string (nullable = true)
 |-- _schemaLocation: string (nullable = true)
 |-- _seg: string (nullable = true)
 |-- _sli: string (nullable = true)
 |-- _sr: string (nullable = true)
 |-- _ss: string (nullable = true)
 |-- _xsi: string (nullable = true)
 |-- env:Body: struct (nullable = true)
 |    |-- _contentSet: string (nullable = true)
 |    |-- _majVers: long (nullable = true)
 |    |-- _minVers: double (nullable = true)
 |    |-- env:ContentItem: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _action: string (nullable = true)
 |    |    |    |-- env:Data: struct (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
 |    |    |    |    |-- sr:Source: struct (nullable = true)
 |    |    |    |    |    |-- _organizationId: long (nullable = true)
 |    |    |    |    |    |-- _sourceId: long (nullable = true)
 |    |    |    |    |    |-- sr:Auditors: struct (nullable = true)
 |    |    |    |    |    |    |-- sr:Auditor: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |    |-- _auditorId: long (nullable = true)
 |    |    |    |    |    |    |    |    |-- sr:AuditorEnumerationId: long (nullable = true)
 |    |    |    |    |    |    |    |    |-- sr:AuditorOpinionCode: string (nullable = true)
 |    |    |    |    |    |    |    |    |-- sr:AuditorOpinionId: long (nullable = true)
 |    |    |    |    |    |    |    |    |-- sr:AuditorOpinionOnInternalControlsId: long (nullable = true)
 |    |    |    |    |    |    |    |    |-- sr:IsPlayingAuditorRole: boolean (nullable = true)
 |    |    |    |    |    |    |    |    |-- sr:IsPlayingCSRAuditorRole: boolean (nullable = true)
 |    |    |    |    |    |    |    |    |-- sr:IsPlayingTaxAdvisorRole: boolean (nullable = true)
 |    |    |    |    |    |-- sr:CapitalChangeAdjustmentDate: string (nullable = true)
 |    |    |    |    |    |-- sr:ContainsPreliminaryData: boolean (nullable = true)
 |    |    |    |    |    |-- sr:ContainsRestatement: boolean (nullable = true)
 |    |    |    |    |    |-- sr:CumulativeAdjustmentFactor: double (nullable = true)
 |    |    |    |    |    |-- sr:Dcn: string (nullable = true)
 |    |    |    |    |    |-- sr:DocFormat: string (nullable = true)
 |    |    |    |    |    |-- sr:DocumentId: long (nullable = true)
 |    |    |    |    |    |-- sr:FilingDateTime: string (nullable = true)
 |    |    |    |    |    |-- sr:FilingDateTimeUTCOffset: long (nullable = true)
 |    |    |    |    |    |-- sr:IsFilingDateTimeEstimated: boolean (nullable = true)
 |    |    |    |    |    |-- sr:SourceTypeCode: string (nullable = true)
 |    |    |    |    |    |-- sr:SourceTypeId: long (nullable = true)
 |    |    |    |    |    |-- sr:StatementDate: string (nullable = true)
 |    |    |    |    |    |-- sr:ThirdPartySourceCode: string (nullable = true)
 |    |    |    |    |    |-- sr:ThirdPartySourceCodeId: long (nullable = true)
 |    |    |    |    |    |-- sr:ThirdPartySourcePriority: long (nullable = true)
 |-- env:Header: struct (nullable = true)
 |    |-- env:Info: struct (nullable = true)
 |    |    |-- env:Id: string (nullable = true)
 |    |    |-- env:TimeStamp: string (nullable = true)
 |    |-- fun:DataPartitionId: long (nullable = true)
 |    |-- fun:OrgId: long (nullable = true)

这就是我正在做的

val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartition"),$"env:Header.env:info.env:TimeStamp".as("TimeStamp"),$"column1.*")


//val childDF=dfType.select($"_organizationId".as("organizationId"), $"_sourceId".as("sourceId"), explode($"sr:Auditors.sr:Auditor").as("Auditors"), getFFActionChild($"FFAction|!|").as("FFAction|!|"))

//childDF.show()

我不知道我错过了什么?

注意: 我知道对此有很多问题,但有人可以解释我何时使用爆炸以及为什么这在我的情况下不起作用?

最佳答案

根据你的问题,我假设你正在尝试爆炸 Auditor具有外部四列的数组 schema (DataPatition,TimeStamp,_organizationId,_sourceId) 。如果我的假设是正确的,那么在得到 dfContentItem 后执行以下三个步骤dataframe 应该可以解决您面临的问题

val sourceDF=dfContentItem.select($"DataPartition", $"TimeStamp", $"env:Data.sr:Source.*")
val childDF = sourceDF.select($"DataPartition", $"TimeStamp", $"_organizationId", $"_sourceId", explode($"sr:Auditors.sr:Auditor").as("Auditors"))
childDF.select($"DataPartition", $"TimeStamp", $"-organizationId", $"_sourceId", $"Auditors.*").show(false)

说明

DataPatition,TimeStamp创建 dfContentItem 时已选择列数据框。 _organizationId,_sourceId列绑定(bind)在 "env:Data.sr:Source" 内,所以上面解决方案中的第一行是分隔 struct Source字段分成单独的列,其中将有 _organizationId,_sourceId字段分成单独的列。上面第二行代码explodes Auditor Auditors 的元素最后一行代码是将分解的结构分成单独的列。

关于scala - 如何在 Spark/Scala 中使用爆炸,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48740610/

相关文章:

apache-spark - 连接具有相同分区的两个 DataFrame 后,Spark DataFrame 具有哪些属性?

java - 如何在Java中的javaPairRDD上使用aggregateByKey?

scala - Spark 1.6 : java. lang.IllegalArgumentException : spark. sql.execution.id 已经设置

scala - 如何在 Scala 中使用模式匹配以惯用的方式安排返回顺序

scala - 将 csv 转换为 RDD

scala - 我如何开始学习 scala 中的无形概念

scala - 在 Spark 中连接 Maptype 值时如何处理空值

python - PySpark 1.5 如何将时间戳从秒截断到最近的分钟

apache-spark - Pyspark udf 对于不带参数的函数失败,但对于不带参数的 lambda 有效

java - 如何在不运行App(main)的情况下初始化App的成员?