我正在尝试根据列中的值是否等于列表来过滤 Spark 数据框。我想做这样的事情:
filtered_df = df.where(df.a == ['list','of' , 'stuff'])
filtered_df
仅包含 filtered_df.a
值为 ['list','of' , 'stuff'] 的行
a
的类型是array (nullable = true)
。
最佳答案
更新:
在当前版本中,您可以使用文字的数组
:
from pyspark.sql.functions import array, lit
df.where(df.a == array(*[lit(x) for x in ['list','of' , 'stuff']]))
原始答案:
好吧,有点老套的方法不需要 Python 批处理作业,是这样的:
from pyspark.sql.functions import col, lit, size
from functools import reduce
from operator import and_
def array_equal(c, an_array):
same_size = size(c) == len(an_array) # Check if the same size
# Check if all items equal
same_items = reduce(
and_,
(c.getItem(i) == an_array[i] for i in range(len(an_array)))
)
return and_(same_size, same_items)
快速测试:
df = sc.parallelize([
(1, ['list','of' , 'stuff']),
(2, ['foo', 'bar']),
(3, ['foobar']),
(4, ['list','of' , 'stuff', 'and', 'foo']),
(5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])
df.where(array_equal(col('a'), ['list','of' , 'stuff'])).show()
## +---+-----------------+
## | id| a|
## +---+-----------------+
## | 1|[list, of, stuff]|
## +---+-----------------+
关于python - 按列值是否等于 Spark 中的列表进行过滤,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36207112/