scala - Spark DataFrame 中查找重叠数据的函数

标签 scala function apache-spark apache-spark-sql

我有以下数据框:

我正在尝试查找一起乘坐过 3 趟以上火车的乘客。

所以对于上面的例子，ID为1和2的乘客乘坐同一趟车的次数超过3次[2,3,4,6]，ID为4和5的乘客乘坐同一趟车的次数超过3次三倍 [7,32,44,54]

有没有可以为此编写的 scala 函数？我尝试过相交类型函数，但我似乎无法将其应用于整个 DataFrame。

感谢您的帮助。

因此，对于预期的输出，我认为它将返回一个包含以下内容的 DataFrame:

我有一个大约有 15,000 行的 DataFrame

谢谢

最佳答案

您可以将自连接与 array_intersect 内置函数等结合起来:

import org.apache.spark.sql.functions.expr

val df = Seq(
  (1, Seq(1, 2, 3, 4, 6)),
  (2, Seq(2, 3, 4, 6, 7)),
  (3, Seq(1, 2, 5, 9, 100)),
  (4, Seq(11, 2, 4, 5, 7, 32, 44, 54)),
  (5, Seq(7, 12, 34, 32, 44, 54)),
  (6, Seq(5, 21)) 
).toDF("passengerId", "trainId")

df.as("d1").join(df.as("d2"), $"d1.passengerId" =!= $"d2.passengerId")
            .selectExpr("d1.passengerId as passengerId1", "d2.passengerId as passengerId2", "d1.trainId as trainId1", "d2.trainId as trainId2")
            .where("size(array_intersect(trainId1, trainId2)) > 3")
            .selectExpr("array_sort(array(passengerId1, passengerId2)) as ar")
            .distinct()
            .selectExpr("ar[0] as usr1", "ar[1] as usr2")
            .show()

// +----+----+
// |usr1|usr2|
// +----+----+
// |1   |2   |
// |4   |5   |
// +----+----+

关于scala - Spark DataFrame 中查找重叠数据的函数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64479286/

上一篇：codec - 如何从ogg/opus文件中一一读取OPUS数据包

下一篇：python - 如何将 NumPy 1.18.5 for Python3.8 与 Anaconda 一起使用？

相关文章：

parsing - 在 Scala 中使用解析器组合器创建递归数据结构

scala - Apache Spark Mllib 2.1.0 出现 Scala sbt 错误

javascript - 将项目添加到数组开头而不使用 unshift 方法的函数

function - Golang 在编写函数闭包时是否会自动将变量分配为参数？

hadoop - 什么时候文件 "splittable"？

apache-spark - 扩展 DefaultCodec 以支持 Hadoop 文件的 Zip 压缩

scala - 错误 : value is not a member of object using Scala on the shell

http - Scala http 操作

c++ - 为什么 const 关键字对于定义模板参数是强制性的？

scala - 为spark rdd元素添加前缀