apache-spark - Spark SQL中的哈希函数 - 不同的字符串都具有相同的哈希值

标签 apache-spark hash apache-spark-sql

我想为每封电子邮件生成不同的哈希值,但是,我发现我正在为不同的电子邮件生成相同的哈希值,例如:

select hash('pipohecho@hotmail.com'),
       hash('rozas_huertas@hotmail.com'),
       hash('miguelilloooooooooouu@hotmail.com'),
       hash('rjdzpmsyi@hotmail.com'),
       hash('pepe@hotmail.com')

enter image description here

这些情况:hash('pipohecho@hotmail.com'), hash('rozas_huertas@hotmail.com'), hash('miguellilloooooooooouu@hotmail.com'), hash('rjdzpmsyi@hotmail.com ') 生成相同的hash-1517714944,那么我有两个问题:

  1. 这怎么可能?
  2. 如何使用 Spark SQL 为每封电子邮件生成唯一的哈希?

谢谢

最佳答案

似乎有关于 collision in Hash probabilities 的文章在这里。


尝试使用 xxhash64(from spark-3),md5,sha2 函数来获取唯一的哈希值。

示例:

spark.sql("""select xxhash64('pipohecho@hotmail.com'),
       xxhash64('rozas_huertas@hotmail.com'),
       xxhash64('miguelilloooooooooouu@hotmail.com'),
       xxhash64('rjdzpmsyi@hotmail.com'),
       xxhash64('pepe@hotmail.com')""").show()

#+-------------------------------+-----------------------------------+-------------------------------------------+-------------------------------+--------------------------+
#|xxhash64(pipohecho@hotmail.com)|xxhash64(rozas_huertas@hotmail.com)|xxhash64(miguelilloooooooooouu@hotmail.com)|xxhash64(rjdzpmsyi@hotmail.com)|xxhash64(pepe@hotmail.com)|
#+-------------------------------+-----------------------------------+-------------------------------------------+-------------------------------+--------------------------+
#|6332927369894443419            |-8140372026824474906               |-9124920009896762502                       |1936246589584419991            |954028670536665140        |
#+-------------------------------+-----------------------------------+-------------------------------------------+-------------------------------+--------------------------+


spark.sql("""select md5('pipohecho@hotmail.com'),
       md5('rozas_huertas@hotmail.com'),
       md5('miguelilloooooooooouu@hotmail.com'),
       md5('rjdzpmsyi@hotmail.com'),
       md5('pepe@hotmail.com')""").show()

#+------------------------------------------+----------------------------------------------+------------------------------------------------------+------------------------------------------+-------------------------------------+
#|md5(CAST(pipohecho@hotmail.com AS BINARY))|md5(CAST(rozas_huertas@hotmail.com AS BINARY))|md5(CAST(miguelilloooooooooouu@hotmail.com AS BINARY))|md5(CAST(rjdzpmsyi@hotmail.com AS BINARY))|md5(CAST(pepe@hotmail.com AS BINARY))|
#+------------------------------------------+----------------------------------------------+------------------------------------------------------+------------------------------------------+-------------------------------------+
#|7ce30aa0209335873f79e64c2eb465ff          |9d58c495ab87f2e3a4a9adc6c8fbbb76              |c283a7c6f09712fc5ba4ea30334e2c25                      |6766da691171aa5c56a70b89bd4590fa          |ab888b1a15b420b410d23b927a370013     |
#+------------------------------------------+----------------------------------------------+------------------------------------------------------+------------------------------------------+-------------------------------------+


spark.sql("""select sha2('pipohecho@hotmail.com',256),
       sha2('rozas_huertas@hotmail.com',256),
       sha2('miguelilloooooooooouu@hotmail.com',256),
       sha2('rjdzpmsyi@hotmail.com',256),
       sha2('pepe@hotmail.com',256)""").show()

#+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
#|sha2(CAST(pipohecho@hotmail.com AS BINARY), 256)                |sha2(CAST(rozas_huertas@hotmail.com AS BINARY), 256)            |sha2(CAST(miguelilloooooooooouu@hotmail.com AS BINARY), 256)    |sha2(CAST(rjdzpmsyi@hotmail.com AS BINARY), 256)                |sha2(CAST(pepe@hotmail.com AS BINARY), 256)                     |
#+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
#|02068bc029cd26888a4ba630ecfa91b4afc2bf72c4adeabcfcd32459529c61bb|391af34e53d82ce8f12a1396d5ae74d96f3ea583cf3fd864816b29586ed002f8|fde18d7d27497717a8a77a0eace29ad5dbcb7319637be033c3e66a068a2bd983|b07300bee7e68326143c40f75b608201f5db667a18bb73b63f9f909454521753|921efc4884d3c8a32899c079024386641564ec0d0966cc059429bbd33770e421|
#+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+

关于apache-spark - Spark SQL中的哈希函数 - 不同的字符串都具有相同的哈希值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68939446/

相关文章:

javascript - 在javascript中使用sortBy按值升序排序散列

零售MAC计算的C#实现(ISOIEC 9797-1 MAC算法3)

security - bcrypt 如何比增加 SHA 迭代次数更具 future 证明?

scala - 如何使 SparkSession 和 Spark SQL 隐式全局可用(在函数和对象中)?

python - PySpark 上 Spark-cassandra 的服务器端过滤

java - 创建 SQLContext 对象时,构造函数 HiveContext(JavaSparkContext) 出现未定义错误

apache-spark - 根据条件组合 Spark 数据框列中的多行

apache-spark - Spark RDD 是否会出现无法满足不变性的情况?

apache-spark - 谷歌云数据流 : Synchronize/merge multiple pipeline into one

scala - 如何合并SPARK数据框创建的文件夹中的所有零件文件并重命名为scala中的文件夹名称