python - 比较二进制数据的最快方法？

我的数据格式为 0010011 或 [False, False, True, False, False, True, True]。这种格式大约有 1000 万个示例，每个示例有 1000 个值而不是 7 个值。

在我的用例中，我得到了一个相同表单的新条目。然后，我想获取 100 个最相等条目的索引。在这里，最相等被定义为分别具有最多相交的1 或True。例如，两个条目 00011 和 00010 有一个相交 1。

目前，我正在做这样的比较:

similarties = []
for idx, binary_1 in enumerate(complete_list):
    similarties += [(idx, np.binary_repr(binary_1 & binary_2).count('1'))]
similarties.sort(key=lambda t: t[1], reverse=True)

对于 10000 个随机测试条目，这需要 0.2 秒。有没有更快的方法来做到这一点？

最佳答案

更新:发现速度提高了三倍。

这是一种在压缩位上使用 numpy 来节省内存的方法。要在这种格式和 uint8 类型的单个 0 和 1 之间进行转换，numpy 提供了函数 packbits 和 解包位。

下面的代码预先计算了所有 2^16 模式的总和，这些模式可以由 16 位的 block 组成。

(旧版本在数据和模板中查找字节对)

我们使用 View 转换为 uint64 来对 64 位的 block 执行按位交集，然后转换回 uint16 进行查找。

为了找到最接近的 n，我们使用 argpartition (O(N)) 而不是 argsort (O(N log N))。

import numpy as np

n, m = 1_000_000, 1_000

data = np.random.randint(0, 256, (n, (m + 63) // 64 * 8), dtype=np.uint8)
test = np.random.randint(0, 256, ((m + 63) // 64 * 8,), dtype=np.uint8)

def create_lookup_1d():
    x, p = np.ogrid[:1<<16, :16]
    p = 1 << p
    return np.count_nonzero(x & p, axis=1)

lookup_1d = create_lookup_1d()

def find_closest(data, test, n):
    similarities = lookup_1d[(data.view(np.uint64) & test.view(np.uint64))
                             .view(np.uint16)].sum(axis=1)
    top_n = np.argpartition(similarities, len(data)-n)[-n:]
    return top_n, similarities[top_n]

# below is obsolete older version

def create_lookup_2d():
    x, y, p = np.ogrid[:256, :256, :8]
    p = 1 << p
    return np.count_nonzero(x & y & p, axis=2)

lookup_2d = create_lookup_2d()

def find_closest_old(data, test, n):
    similarities = lookup_2d[data, test].sum(axis=1)
    top_n = np.argpartition(similarities, len(data)-n)[-n:]
    return top_n, similarities[top_n]

演示(一百万个条目，每个条目一千位，找到一百个最佳):

>>> import time
>>> t = time.perf_counter(); find_closest(data, test, 100); t = time.perf_counter() - t
(array([913659, 727762, 819589, 810536, 671392, 573459, 197431, 642848,
         8792, 710169, 656667, 692412,  23355, 695527, 276548, 756096,
       286481, 931702, 301590, 309232, 223684, 838507, 888607, 205403,
       988198, 600239, 256834, 876452, 793813,  46501, 559521, 697295,
       948215, 247923, 503962, 808630, 515953,  22821, 614888, 487735,
       443673, 174083, 906902, 613131, 546603, 147657, 332898, 381553,
       808760, 383885, 107547,  85942,  20966, 880034, 522925,  18833,
       547674, 901503, 702596, 773050, 734658, 383581, 973043, 387713,
       645705,  27045, 230226,  77403, 906601, 507193, 828268, 175863,
       708155, 130634, 486701, 534719, 643487, 940071, 694781, 470385,
       954446, 134532, 748100, 110987, 417001, 871320, 993915, 489725,
         6509,  38345, 705618, 637435, 311252, 347282, 536091, 663643,
       830238, 376695, 896090, 823631]), array([305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305,
       305, 306, 305, 306, 306, 305, 305, 305, 305, 305, 306, 305, 306,
       305, 306, 314, 308, 307, 309, 306, 308, 307, 308, 307, 312, 308,
       306, 316, 306, 306, 307, 307, 308, 309, 308, 307, 309, 309, 311,
       309, 310, 310, 307, 307, 306, 307, 306, 307, 309, 308, 309, 308,
       306, 307, 309, 306, 306, 306, 311, 306, 308, 307, 306, 306, 307,
       308, 306, 307, 310, 307, 306, 307, 309, 306, 306, 310, 313, 306,
       306, 307, 310, 306, 307, 307, 309, 311, 307]))
>>> t
0.4612512579988106

关于python - 比较二进制数据的最快方法？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49668314/

python - 比较二进制数据的最快方法？

上一篇：python - 在 Python 中导入 Statsmodel 时出错

下一篇：Python/Pandas-在趋势转变时应用标签(识别数据集中的局部最大值和最小值)