python - Python 中的滚动哈希非常快？

我在写玩具rsync - 类似 Python 中的工具。像许多类似的工具一样，它首先会使用一个非常快的哈希值，如 rolling hash ，然后在找到匹配项后使用 SHA256(但后者在这里不合时宜:SHA256、MDA5 等作为滚动哈希太慢了)。

我目前正在测试各种快速哈希方法:

import os, random, time

block_size = 1024  # 1 KB blocks
total_size = 10*1024*1024  # 10 MB random bytes
s = os.urandom(total_size)

t0 = time.time()
for i in range(len(s)-block_size):
    h = hash(s[i:i+block_size])
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))

我得到: 0.8 MB/s ... 所以 Python 内置 hash(...) 这里的功能太慢了。

哪种解决方案可以在标准机器上实现至少 10 MB/s 的更快散列？

我试过

import zlib
...
    h = zlib.adler32(s[i:i+block_size])

但也好不到哪里去 (1.1 MB/s)

我试过 sum(s[i:i+block_size]) % modulo而且它也很慢

有趣的事实:即使没有任何散列函数，循环本身也很慢!

t0 = time.time()
for i in range(len(s)-block_size):
    s[i:i+block_size]

我得到:仅 3.0 MB/s!所以在 s 上有一个循环访问滚动块的简单事实已经很慢了。

与其重新发明轮子并编写我自己的散列/或使用自定义 Rabin-Karp 算法，您有什么建议，首先加速这个循环，然后作为散列？

编辑:上面“有趣的事实”慢循环的(部分)解决方案:

import os, random, time, zlib
from numba import jit

@jit()
def main(s):
    for i in range(len(s)-block_size):
        block = s[i:i+block_size]

total_size = 10*1024*1024  # 10 MB random bytes
block_size = 1024  # 1 KB blocks
s = os.urandom(total_size)
t0 = time.time()
main(s)
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))

使用 Numba，有一个巨大的改进:40.0 MB/s，但这里仍然没有完成哈希。至少我们没有以 3 MB/s 的速度被阻塞。

最佳答案

Instead of reinventing the wheel and write my own hash / or use custom Rabin-Karp algorithms, what would you suggest, first to speed up this loop, and then as a hash?

以这种心态开始总是很好，但似乎您没有得到滚动哈希的想法。
使散列函数非常适合滚动的原因在于它能够重用先前的处理。

A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window.

From the same wikipedia page

如果没有 timeit，很难比较不同机器的性能，但我将您的脚本更改为使用带有质数模的简单多项式散列(使用 Mersene prime 会更快，因为模运算可以通过二进制运算完成):

import os, random, time

block_size = 1024  # 1 KB blocks
total_size = 10*1024*1024  # 10 MB random bytes
s = os.urandom(total_size)

base = 256
mod  = int(1e9)+7

def extend(previous_mod, byte):
    return ((previous_mod * base) + ord(byte)) % mod

most_significant = pow(base, block_size-1, mod)

def remove_left(previous_mod, byte):
    return (previous_mod - (most_significant * ord(byte)) % mod) % mod
    
def start_hash(bytes):
    h = 0
    for b in bytes:
        h = extend(h, b)
    return h

t0 = time.time()

h = start_hash(s[:block_size])
for i in range(block_size, len(s)):
    h = remove_left(h, s[i - block_size])
    h = extend(h, s[i])
    
print('rolling hashes computed in %.1f sec (%.1f MB/s)' % (time.time()-t0, total_size/1024/1024/(time.time()-t0)))

显然，您使用 Numba 取得了相当大的改进，它也可以加速此代码。
为了获得更高的性能，您可能需要编写一个 C(或其他低级语言，如 Rust)函数来一次处理列表的很大一部分，并返回一个带有哈希值的数组。
我正在创建一个 rsync - 也类似工具，但正如我在 Rust 中编写的那样，这个级别的性能不是我关心的问题。相反，我正在遵循 creator of rsync 的提示并试图并行化我所能做的一切，这是用 Python 完成的一项痛苦的任务(如果没有 Jython，可能是不可能的)。

关于python - Python 中的滚动哈希非常快？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60058670/

python - Python 中的滚动哈希非常快？

上一篇：php - 使用 PHP 和 AJAX 进行简单的蜜 jar 表单检查

下一篇：python:从 ScalarMappable 获取整个 numpy 数组的颜色