python - 按列值是否等于 Spark 中的列表进行过滤

标签 python apache-spark pyspark apache-spark-sql

我正在尝试根据列中的值是否等于列表来过滤 Spark 数据框。我想做这样的事情:

filtered_df = df.where(df.a == ['list','of' , 'stuff'])

filtered_df 仅包含 filtered_df.a 值为 ['list','of' , 'stuff'] 的行 a 的类型是array (nullable = true)。

最佳答案

更新:

在当前版本中，您可以使用文字的数组:

from pyspark.sql.functions import array, lit

df.where(df.a == array(*[lit(x) for x in ['list','of' , 'stuff']]))

原始答案:

好吧，有点老套的方法不需要 Python 批处理作业，是这样的:

from pyspark.sql.functions import col, lit, size
from functools import reduce
from operator import and_

def array_equal(c, an_array):
    same_size = size(c) == len(an_array)  # Check if the same size
    # Check if all items equal
    same_items = reduce(
        and_, 
        (c.getItem(i) == an_array[i] for i in range(len(an_array)))
    )
    return and_(same_size, same_items)

快速测试:

df = sc.parallelize([
    (1, ['list','of' , 'stuff']),
    (2, ['foo', 'bar']),
    (3, ['foobar']),
    (4, ['list','of' , 'stuff', 'and', 'foo']),
    (5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])

df.where(array_equal(col('a'), ['list','of' , 'stuff'])).show()
## +---+-----------------+
## | id|                a|
## +---+-----------------+
## |  1|[list, of, stuff]|
## +---+-----------------+

关于python - 按列值是否等于 Spark 中的列表进行过滤，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36207112/

上一篇：python - 在pylint中强制字符串格式

下一篇：python - 适用于 Python 3.5.1 的 OpenCV

java - 在 Javardd 中排序

python - PySpark 将 IntegerTypes 转换为 ByteType 以进行优化

python - 如何在 PySpark 中进行聚合和转置？

apache-spark - PYSPARK:如何根据条件更新列中的值

python - tensorflow 估计器 : predict without loading from checkpoint everytime

python - 如何防止FUNCTYPE被收集

python - Tkinter ttk.treeview iid 会溢出吗？

python - 无法从 Python 运行 Apache Spark 的 Pi 示例

apache-spark - 在 PySpark 中获取列的名称/别名