scala - spark sql中sc.broadcast和broadcast函数的区别

我用过 sc.broadcast用于查找文件以提高性能。

我也开始知道有一个函数叫 broadcast在 Spark SQL 函数中。

两者有什么区别？

我应该使用哪一个来广播引用/查找表？

最佳答案

一句话回答:

1) org.apache.spark.sql.functions.broadcast()函数是用户提供的，给定 sql join 的显式提示。

2) sc.broadcast用于广播只读共享变量。

更多详情 broadcast功能#1:

这是来自
sql/execution/SparkStrategies.scala

这说。

Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that side has an explicit broadcast hint (e.g. the user applied the *
[[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side * of the join will be broadcasted and the other side will be streamed, with no shuffling *
performed. If both sides of the join are eligible to be broadcasted then the *

Shuffle hash join: if the average size of a single partition is small enough to build a hash * table.

Sort merge: if the matching join keys are sortable.

If there is no joining keys, Join implementations are chosen with the following precedence:

BroadcastNestedLoopJoin: if one side of the join could be broadcasted

CartesianProduct: for Inner join

BroadcastNestedLoopJoin

下面的方法根据我们设置的大小控制行为spark.sql.autoBroadcastJoinThreshold默认为 10mb

Note : smallDataFrame.join(largeDataFrame) does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame) does.

/** Matches a plan whose output should be small enough to be used in broadcast join.
         **/
        private def canBroadcast(plan: LogicalPlan): Boolean = {
          plan.statistics.isBroadcastable ||
            plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
        }

今后below configurations will be deprecated in coming versions of spark .

关于scala - spark sql中sc.broadcast和broadcast函数的区别，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40320441/

scala - spark sql中sc.broadcast和broadcast函数的区别

上一篇：php - Laravel 5.2 - Metatag 规范 URL

下一篇：npm - react native 初始化挂起