scala - spark中的哈希函数

标签 scala apache-spark hash apache-spark-sql

我正在尝试向数据框中添加一列，其中将包含 另一列的哈希 .

我找到了这段文档:
https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
并尝试了这个:

import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))

但是 hash() 使用的哈希函数是什么？ ?是murmur , sha , md5 ，还有什么？

我在此列中得到的值是整数，因此这里的值范围可能是 [-2^(31) ... +2^(31-1)] .
我可以在这里获得长期值(value)吗？我可以得到一个字符串哈希吗？
如何为此指定具体的哈希算法？
我可以使用自定义哈希函数吗？

最佳答案

它是基于 source code 的杂音:

  /**
   * Calculates the hash code of given columns, and returns the result as an int column.
   *
   * @group misc_funcs
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def hash(cols: Column*): Column = withExpr {
    new Murmur3Hash(cols.map(_.expr))
  }

关于scala - spark中的哈希函数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53634650/

上一篇：raku - Perl 6 : maxpairs warns about stringification of undefined values

下一篇：Keras功能API : Combine CNN model with a RNN to to look at sequences of images

相关文章：

r - R中具有不同摘要的相同数据帧？

hash - "over"中的 "overpass-the-hash"是什么意思？

scala - 使用 Scala Shapeless 证明自然数加法的结合性

scala - Spark 选择并添加带有别名的列

python - 当 RDD 包含用户定义的类时，为什么 Apache PySpark top() 会失败？

performance - 当有单独的链接与列表链接时，为什么我们在哈希表中使用线性探测？

Scala - 组合器解析，替代品的顺序似乎很重要

scala - Scala中的方法参数验证，用于理解和单子(monad)

scala - Scala 案例类可以在继承函数和非继承函数中匹配吗？

java - Apache Spark 和 Java 错误 - 由 : java. lang.StringIndexOutOfBoundsException : begin 0, 结束 3，长度 2 引起