algorithm - 用于无序 ID 集的良好哈希函数

我遇到了以下问题:

给定一个无序的、任意大的 id 集(例如 32 位空间中的 1、2、5、6、8)
在更大的空间(例如 64 位)中计算哈希码。

简单的方法是为每个单独的 ID 计算哈希函数，然后将所有内容异或在一起。但是，如果您有一个用于 ID 的 32 位空间和一个用于散列函数的 64 位空间，那么这可能不是解决此问题的最佳方法(冲突等...)。

我一直在考虑使用 Murmur3 终结器，然后将结果一起进行异或运算，但我想出于同样的原因，这也行不通(老实说，我不确定)。同样，简单地将值相乘也应该可行(因为 ab = ba)，但我不确定该哈希函数的“好”程度。

很明显，我想到了对 ID 进行排序，之后 Murmur3 会做得很好。不过，如果可以避免，我不想排序。

这样的哈希函数有什么好的算法？

更新

好吧，我想我可能有点困惑。

关于 Why is XOR the default way to combine hashes? 的第二个答案实际上解释了组合哈希函数。在那里介绍的案例中，XOR 被认为是一个糟糕的散列函数，因为“dab”产生与“abd”相同的代码。在我的例子中，我希望这些东西生成相同的散列值——但我也想尽量减少-say-“abc”也生成与-say-“abd”相同的散列值的机会.

大多数哈希函数的全部目的是，如果您向它们提供数据，它们很可能会使用完整的 key 空间。一般来说，这些散列函数利用数据是顺序这一事实，并乘以一个大数字来随机排列位。所以简单来说:

var hash = SomeInitialConstant;
foreach (var id in ids) {
  hash = hash * SomeConstant + hashCode(id);
}
// ... optionally shuffle bits around as finalizer
return hash;

现在，如果 ID 的顺序始终相同，则此方法可以正常工作。但是，如果 ID 是无序的，这将不起作用，因为 x * constant + y 不可交换。

如果您对 ID 求平方，我认为您最终不会使用整个哈希空间。考虑如果你有大数字会发生什么，比如 100000、100001 等。这些正方形是 10000000000、10000200001 等。你不可能得到一个正方形来生成像 900000 这样的数字(仅仅是因为 sqrt(900000)是一个带分数的数字)。

更一般地说，10000000000 和 10000200001 之间的所有哈希空间很可能会丢失。但是，0 和 10 之间的空间会发生很多冲突，因为小数字的平方之间的可用散列空间也很小。

使用大 key 空间的全部目的显然是减少冲突。我想要一个相当大的散列空间(比如 256 位)以确保在现实生活场景中几乎不存在冲突。

最佳答案

我刚刚检查过:

使用 32 位哈希
在一个 64K 的数组表中
有 64K 个项目(负载因子 = 100%)
8 位值(无符号字符)
(数组大小 4...64)
散列函数 := cnt+ (sum cube (arr[i]))
或 := sum(square (zobrist[arr[i]))
Zobrist 的效果更好一些，(但数组需要进一步随机化)
并且碰撞没有超过最佳哈希函数的预期。
为了避免重新计算(时空权衡)，我实际上存储对象中的哈希值
由于碰撞是生活中的常态，您可以将排序推迟到您真正需要它的那一刻用于最终比较(当链长开始增长到 1 以上时)

#include <stdio.h>
#include <stdlib.h>

struct list {
        struct list *next;
        unsigned hash;
        unsigned short cnt;
        unsigned char *data;
        };

struct list *hashtab[1<<16] = {NULL, };
#define COUNTOF(a) (sizeof a / sizeof a[0])
unsigned zobrist[256] = {0,};
/*************************/
unsigned hash_it(unsigned char *cp, unsigned cnt)
{
unsigned idx;
unsigned long long hash = 0;

for(idx=0; idx < cnt; idx++) {
#if 0   /* cube */
        hash += (cp[idx] * cp[idx] * cp[idx]);
#else
        unsigned val;
        val = zobrist[cp[idx]];
        hash += (val * val);
#endif
        }
#if 0   /* as a tie-breaker: add the count (this avoids pythagorean triplets but *not* taxi-numbers) */
hash += cnt;
#endif
return hash;
}
/*************************/
struct list *list_new(unsigned cnt){
struct list *p;
unsigned idx;

p = malloc( sizeof *p + cnt);
p->data = (unsigned char*)(p+1);
p->cnt = cnt;
p->next = NULL;

for(idx=0; idx < cnt; idx++) {
        p->data[idx] = 0xff & rand();
        }
p->hash = hash_it(p->data, p->cnt);
return p;
}
/*************************/
void do_insert(struct list *this)
{
struct list **pp;
unsigned slot;

slot  = this->hash % COUNTOF(hashtab);
for (pp = &hashtab[slot]; *pp; pp = &(*pp)->next) {;}
*pp = this;
}
/*************************/
void list_print(struct list *this)
{
unsigned idx;
if (!this) return;

printf("%lx data[%u] = ", (unsigned long) this->hash, this->cnt);

for (idx=0; idx < this->cnt; idx++) {
        printf("%c%u"
        , idx ? ',' : '{' , (unsigned int) this->data[idx] );
        }
printf("}\n" );
}
/*************************/
unsigned list_cnt(struct list *this)
{
unsigned cnt;
for(cnt=0; this; this=this->next) { cnt++; }
return cnt;
}
/*************************/
unsigned list_cnt_collisions(struct list *this)
{
unsigned cnt;
for(cnt=0; this; this=this->next) {
        struct list *that;
        for(that=this->next; that; that=that->next) {
                if (that->cnt != this->cnt) continue;
                if (that->hash == this->hash) cnt++;
                }
        }
return cnt;
}
/*************************/
int main(void)
{
unsigned idx, val;
struct list *p;
unsigned hist[300] = {0,};

        /* NOTE: you need a better_than_default random generator
        ** , the zobrist array should **not** contain any duplicates
        */
for (idx = 0; idx < COUNTOF(zobrist); idx++) {
        do { val = random(); } while(!val);
        zobrist[idx] = val;
        }

        /* a second pass will increase the randomness ... just a bit ... */
for (idx = 0; idx < COUNTOF(zobrist); idx++) {
        do { val = random(); } while(!val);
        zobrist[idx] ^= val;
        }
        /* load-factor = 100 % */
for (idx = 0; idx < COUNTOF(hashtab); idx++) {
        do {
          val = random();
          val %= 0x40;
        } while(val < 4); /* array size 4..63 */
        p = list_new(val);
        do_insert(p);
        }

for (idx = 0; idx < COUNTOF(hashtab); idx++) {
        val = list_cnt( hashtab[idx]);
        hist[val] += 1;
        val = list_cnt_collisions(hashtab[idx]);
        if (!val) continue;
        printf("[%u] : %u\n", idx, val);
        for (val=0,p = hashtab[idx]; p; p= p->next) {
                printf("[%u]: ", val++);
                list_print(p);
                }
        }

for (idx = 0; idx < COUNTOF(hist); idx++) {
        if (!hist[idx]) continue;
        printf("[%u] = %u\n", idx, hist[idx]);
        }

return 0;
}
/*************************/

输出直方图(链长，0 := 空槽):

$ ./a.out
[0] = 24192
[1] = 23972
[2] = 12043
[3] = 4107
[4] = 1001
[5] = 181
[6] = 34
[7] = 4
[8] = 2

最后说明:除了 Zobrist[] 的平方和之外，您还可以将它们异或在一起(假设条目是唯一的)

额外的最后说明:C 标准库 rand()功能可能无法使用。 RAND_MAX 只能是 15 位:0x7fff (32767)。要填充 zobrist 表，您需要更多值。这可以通过异或一些额外的 (rand() << shift) 来完成进入更高位。

新结果，使用(来自)一个非常大的源域(32 个元素 * 8 位)，将其散列为 32 位散列键，插入到 1<<20 的散列表中插槽。

Number of elements 1048576 number of slots 1048576
Element size = 8bits, Min setsize=0, max set size=32
(using Cubes, plus adding size) Histogram of chain lengths:
[0] = 386124 (0.36824)
[1] = 385263 (0.36742)
[2] = 192884 (0.18395)
[3] = 64340 (0.06136)
[4] = 16058 (0.01531)
[5] = 3245 (0.00309)
[6] = 575 (0.00055)
[7] = 78 (0.00007)
[8] = 9 (0.00001)

这非常接近最优；对于 100% 加载的哈希表，直方图中的前两个条目应该相等，在完美的情况下，都是 1/e。前两个条目是空槽和只有一个元素的槽。

关于algorithm - 用于无序 ID 集的良好哈希函数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41187876/

algorithm - 用于无序 ID 集的良好哈希函数

上一篇：algorithm - 计算器(例如 Wolfram alpha)如何计算极大的阶乘？

下一篇：java - 递归删除链表中的最后一次出现，Java