Java/Spark : how to find key with max value in a col with array of struct of map

标签 java dataframe apache-spark-sql aggregation apache-spark-mllib

我有一个数据框,我想获取 map 中具有最大值的键。

数据框创建:

Dataset<Row> data = spark.read()
                .option("header", "true")
                .option("inferSchema", "true")
                .csv("/home/path/to/file/verify.csv");
//loading Spark ML model
PipelineModel gloveModel = PipelineModel.load("models/gloveModel");
Dataset<Row> df = gloveModel.transform(data);

df.printSchema();

 |-- id: integer (nullable = true)
 |-- description: string (nullable = true)
 |-- class: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- result: string (nullable = true)     
 |    |    |-- metadata: map (nullable = true)      
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

// map 条目字段如下:

df.select("class.metadata").show(10,50);

+-----------------------------------------------------------------------------------------------------------------+
|                                                                                                         metadata|
+-----------------------------------------------------------------------------------------------------------------+
|  [[Sports -> 3.2911853E-9, Business -> 5.1852658E-6, World -> 3.96135E-9, Sci/Tech -> 0.9999949, sentence -> 0]]|
|      [[Sports -> 1.9902605E-10, Business -> 1.0305631E-8, World -> 1.0, Sci/Tech -> 3.543277E-9, sentence -> 0]]|
|    [[Sports -> 1.0, Business -> 8.1944885E-12, World -> 4.554111E-13, Sci/Tech -> 1.7239962E-12, sentence -> 0]]|
+-----------------------------------------------------------------------------------------------------------------+

我想实现以下结果(一行中每个 map 中的最高值):

+--------------+
|    prediction|
+--------------+
|      Sci/Tech|
|         World|
|        Sports|
+--------------+

我已经尝试过:

df.select(map_values(col("class.metadata"))).show(10, 50); 但最终出现错误:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'map_values(`class`.`metadata`)' due to data type mismatch: argument 1 requires map type, however, '`class`.`metadata`' is of array<map<string,string>> type.;;
'Project [map_values(class#95.metadata) AS map_values(class.metadata)#106]...

df.select(展平(col("class"))).show(); 错误:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'flatten(`class`)' due to data type mismatch: The argument should be an array of arrays, but '`class`' is of array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>> type.;;
'Project [flatten(class#95) AS flatten(class)#106]

我的 Spark SQL 版本是 2.4.0(不推荐使用爆炸函数)

非常感谢任何建议/建议!谢谢!

最佳答案

class.metadata 是Map类型的Array类型。但 map_values 函数仅接受 Map 类型。

使用explode从数据数组中提取 map ,然后将该 map 数据传递给map_values函数。请检查下面。

import org.apache.spark.sql.functions.explode

df.select(explode($"class.metadata").as("metadata")).select(map_values($"metadata")).show(false)

关于Java/Spark : how to find key with max value in a col with array of struct of map,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61456882/

相关文章:

python - 如何使用以任何大写字母开头的正则表达式从 Pandas 系列中提取字符串

r - 如果缺失值的总数超过 R 中的限制,则输出平均值为 NA

apache-spark - 如何针对 Spark DataFrame 并行化/分发查询/计数?

java - Java 中的变量继承

java - 绝地武士 : Could not get a resource from the pool

使用 GSON 的 Java 对象到 JSON

java - 使用 ITEXT 使图像适合 PDF

python - 重新排列Python数据框索引和列

scala - 如何在spark 2.2中模拟array_join()方法

python - 替换字典中的键值