arrays - 如何从 Spark 中的数组 Column 中删除元素?

标签 arrays scala apache-spark dataframe seq

我有一个 Seq 和数据帧。数据框包含一列数组类型。我正在尝试从列中删除 Seq 中的元素。

例如:

val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

    +---------------------------------------------------+
    |sorted_items                                       |
    +---------------------------------------------------+
    |[flannel, and, for, s, shirts, sleeve, warm]       |
    |[3, 5, kitchenaid, s]                              |
    |[5, 6, case, flip, inch, iphone, on, xs]           |
    |[almonds, chocolate, covered, dark, joe, s, the]   |
    |null                                               |
    |[]                                                 |
    |[animation, book]                                  |

预期输出:

+---------------------------------------------------+
|sorted_items                                       |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm]                    |
|[3, 5, kitchenaid]                                 |
|[5, 6, case, flip, inch, iphone, xs]               |
|[almonds, chocolate, covered, dark, joe, the]      |
|null                                               |
|[]                                                 |
|[animation, book]                                  |

如何以有效且优化的方式完成此操作?

最佳答案

使用spark.sql.functions中的array_except:

import org.apache.spark.sql.{functions => F}

val stopWords = Array("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

val newDF = df.withColumn("sorted_items", F.array_except(df("sorted_items"), F.lit(stopWords)))

newDF.show(false)

输出:

+----------------------------------------+
|sorted_items                            |
+----------------------------------------+
|[flannel, shirts, sleeve, warm]         |
|[3, 5, kitchenaid]                      |
|[5, 6, case, flip, inch, iphone, xs]    |
|[almonds, chocolate, covered, dark, joe]|
|null                                    |
|[]                                      |
|[animation, book]                       |
+----------------------------------------+

关于arrays - 如何从 Spark 中的数组 Column 中删除元素?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56180887/

相关文章:

java - 具有数千万用户的现实社交网络模型。我应该使用哪些技术?

arrays - 根据第一个数组的索引查找第二个数组的值

ios - 确定排序后数组是否发生变化

c - Malloc 指针数组错误

Scala 的非大括号 IF 语句

hadoop - 使用什么从 Spark 的 dynamodb 读取/写入?

scala 扩展特征,方法返回各种类型,类型不匹配

java - 从 Java 多线程到 Scala Akka actor

apache-spark - 找不到 SparkSQL key : scale

apache-spark - 为什么 foreachPartition 对于流数据集会出错?