apache-spark - 避免Spark窗口功能中单个分区模式的性能影响

标签 apache-spark pyspark apache-spark-sql partitioning window-functions

我的问题是由用例计算出spark数据帧中连续行之间的差异引起的。

例如,我有:

>>> df.show()
+-----+----------+
|index|      col1|
+-----+----------+
|  0.0|0.58734024|
|  1.0|0.67304325|
|  2.0|0.85154736|
|  3.0| 0.5449719|
+-----+----------+


如果我选择使用“窗口”函数来计算这些,那么我可以这样做:

>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index|      col1| diffs_col1|
+-----+----------+-----------+
|  0.0|0.58734024|0.085703015|
|  1.0|0.67304325| 0.17850411|
|  2.0|0.85154736|-0.30657548|
|  3.0| 0.5449719|       null|
+-----+----------+-----------+


问题:我将数据框明确地划分为一个分区。这会对性能产生什么影响,如果有的话,为什么会这样,我又如何避免呢?因为当我不指定分区时,会收到以下警告:

16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

最佳答案

实际上,对性能的影响几乎与完全省略partitionBy子句的影响相同。所有记录都将被改组到一个分区,在本地进行排序,并一个接一个地依次迭代。

差异仅在于总共创建的分区数。让我们用一个带有10个分区和1000条记录的简单数据集的示例进行说明:

df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))


如果您定义不带子句的框架

w_unpart = Window.orderBy(f.col("index").asc())


并与lag一起使用

df_lag_unpart = df.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
)


总共只有一个分区:

df_lag_unpart.rdd.glom().map(len).collect()


[1000]


与带有虚拟索引的帧定义相比(与您的代码相比,简化了一点:

w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())


将使用等于spark.sql.shuffle.partitions的分区数:

spark.conf.set("spark.sql.shuffle.partitions", 11)

df_lag_part = df.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1")
)

df_lag_part.rdd.glom().count()


11


仅具有一个非空分区:

df_lag_part.rdd.glom().filter(lambda x: x).count()


1


不幸的是,在PySpark中没有通用的解决方案可以用来解决此问题。这只是实现与分布式处理模型相结合的固有机制。

由于index列是连续的,因此您可以生成每个分区具有固定记录数的人工分区键:

rec_per_block  = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions"))

df_with_block = df.withColumn(
    "block", (f.col("index") / rec_per_block).cast("int")
)


并使用它来定义框架规范:

w_with_block = Window.partitionBy("block").orderBy("index")

df_lag_with_block = df_with_block.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1")
)


这将使用预期的分区数:

df_lag_with_block.rdd.glom().count()


11


具有大致均匀的数据分布(我们无法避免哈希冲突):

df_lag_with_block.rdd.glom().map(len).collect()


[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]


但是在块边界上有许多空白:

df_lag_with_block.where(f.col("diffs_col1").isNull()).count()


12


由于边界易于计算:

from itertools import chain

boundary_idxs = sorted(chain.from_iterable(
    # Here we depend on sequential identifiers
    # This could be generalized to any monotonically increasing
    # id by taking min and max per block
    (idx - 1, idx) for idx in 
    df_lag_with_block.groupBy("block").min("index")
        .drop("block").rdd.flatMap(lambda x: x)
        .collect()))[2:]  # The first boundary doesn't carry useful inf.


您可以随时选择:

missing = df_with_block.where(f.col("index").isin(boundary_idxs))


并分别填写:

# We use window without partitions here. Since number of records
# will be small this won't be a performance issue
# but will generate "Moving all data to a single partition" warning
missing_with_lag = missing.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
).select("index", f.col("diffs_col1").alias("diffs_fill"))


join

combined = (df_lag_with_block
    .join(missing_with_lag, ["index"], "leftouter")
    .withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))


获得理想的结果:

mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
    combined["diffs_col1"] != df_lag_unpart["diffs_col1"]
)
assert mismatched.count() == 0

关于apache-spark - 避免Spark窗口功能中单个分区模式的性能影响,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41313488/

相关文章:

apache-spark - Pyspark 如何从 word2vec 词嵌入计算 Doc2Vec?

apache-spark - 在 PySpark 中获取列的名称/别名

scala - 如何按多列过滤数据框?

mongodb - 使用 Apache Spark 更新/替换 Mongo 文档

apache-spark - 在通过 JDBC 从 pyspark 数据帧插入到外部数据库表时进行重复键更新

apache-spark - 获取有关工作流程提交后创建的当前 dataproc 集群的信息

apache-spark - rdd.histogram 给出 "can not generate buckets with non-number in RDD"错误

apache-spark - 如何在多列上编写 Pyspark UDAF?

apache-spark - org.apache.spark.sql.Row 无法在 Spark 2.0 Preview 中解析

python - Spark 使用前一行的值向数据帧添加新列