scala - spark sql中sc.broadcast和broadcast函数的区别

标签 scala function apache-spark apache-spark-sql broadcast

我用过 sc.broadcast用于查找文件以提高性能。

我也开始知道有一个函数叫 broadcast在 Spark SQL 函数中。

两者有什么区别?

我应该使用哪一个来广播引用/查找表?

最佳答案

一句话回答:

1) org.apache.spark.sql.functions.broadcast()函数是用户提供的,给定 sql join 的显式提示。

2) sc.broadcast用于广播只读共享变量。

更多详情 broadcast功能#1:

这是来自
sql/execution/SparkStrategies.scala

这说。

  • Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that side has an explicit broadcast hint (e.g. the user applied the *
    [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side * of the join will be broadcasted and the other side will be streamed, with no shuffling *
    performed. If both sides of the join are eligible to be broadcasted then the *
  • Shuffle hash join: if the average size of a single partition is small enough to build a hash * table.
  • Sort merge: if the matching join keys are sortable.
  • If there is no joining keys, Join implementations are chosen with the following precedence:
    • BroadcastNestedLoopJoin: if one side of the join could be broadcasted
    • CartesianProduct: for Inner join
    • BroadcastNestedLoopJoin

  • 下面的方法根据我们设置的大小控制行为spark.sql.autoBroadcastJoinThreshold默认为 10mb

  • Note : smallDataFrame.join(largeDataFrame) does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame) does.


    /** Matches a plan whose output should be small enough to be used in broadcast join.
             **/
            private def canBroadcast(plan: LogicalPlan): Boolean = {
              plan.statistics.isBroadcastable ||
                plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
            }
    

    今后below configurations will be deprecated in coming versions of spark .
    enter image description here

    关于scala - spark sql中sc.broadcast和broadcast函数的区别,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40320441/

    相关文章:

    Python:函数在 for 循环中未接收值

    javascript - 单击按钮时显示文本输入值

    java - 是否有可能(并且明智地)在 JavaRDD 中执行其他 "spark-submit"?

    arrays - 在数组或Scala Spark中的其他任何集合中迭代RDD和存储的值

    scala - Spark集群提交无法绑定(bind)slave地址

    找不到 Scala : value macro

    r - 使用跟踪在 R 中编辑函数?

    apache-spark - Spark 独立集群 - 从站未连接到主站

    java - 提升 json :Custom serializer for java 8 LocalDateTime throwing mapping exception

    scala - 在 Spark 中获取 DataFrame 列的值