apache-spark - 使用正则表达式检查多列中是否有任何大于零的列

标签 apache-spark pyspark apache-spark-sql

我需要在多列上应用when函数。我想检查是否至少有一列的值大于 0。

这是我的解决方案:

df.withColumn("any value", F.when(
   (col("col1") > 0) |
   (col("col2") > 0) |
   (col("col3") > 0) |
   ...
   (col("colX") > 0)
   , "any greater than 0").otherwise(None))

是否可以使用正则表达式执行相同的任务，这样我就不必编写所有列名称？

最佳答案

让我们创建示例数据:

 df = spark.createDataFrame(
    [(0, 0, 0, 0), (0, 0, 2, 0), (0, 0, 0, 0), (1, 0, 0, 0)],
    ['a', 'b', 'c', 'd']
)

然后，您可以使用map和reduce从列列表(例如数据帧的所有列)构建条件，如下所示:

cols = df.columns
from pyspark.sql import functions as F
condition = reduce(lambda a, b: a | b, map(lambda c: F.col(c) > 0, cols))
df.withColumn("any value", F.when(condition, "any greater than 0")).show()

产生:

+---+---+---+---+------------------+
|  a|  b|  c|  d|         any value|
+---+---+---+---+------------------+
|  0|  0|  0|  0|              null|
|  0|  0|  2|  0|any greater than 0|
|  0|  0|  0|  0|              null|
|  1|  0|  0|  0|any greater than 0|
+---+---+---+---+------------------+

关于apache-spark - 使用正则表达式检查多列中是否有任何大于零的列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68158225/

上一篇：react-native - 如何更改 Expo React native 默认端口 19000

下一篇：python - 使用 numpy 测试误报和漏报

apache-spark - 如何使用UDF处理大的增量表？

apache-spark - SparkSQL : How to specify partitioning column while loading dataset from database

apache-spark - 在 2.0 中将 RDD 转换为 Dataframe

scala - Spark数据帧中两行之间的差异

scala - 避免加入Spark Scala DataFrame

hadoop - 在行组大小小于 100 的 spark 中创建 Parquet 文件

pandas - 将 Spark 数据帧转换为 Pandas/R 数据帧的要求

pandas - 将 spark 数据帧转换为 dask 数据帧

apache-spark - 如何在 pyspark 中聚合数组内的值？