python - Numpy 中的向量化字符串操作 : why are they rather slow?

这是那些“主要是出于纯粹的好奇心(可能徒劳地希望我能学到一些东西)”的问题。

我正在研究在对大量字符串进行操作时节省内存的方法，对于某些场景，它看起来像 string operations in numpy可能会有用。然而，我得到了一些令人惊讶的结果:

import random
import string

milstr = [''.join(random.choices(string.ascii_letters, k=10)) for _ in range(1000000)]

npmstr = np.array(milstr, dtype=np.dtype(np.unicode_, 1000000))

使用 memory_profiler 的内存消耗:

%memit [x.upper() for x in milstr]
peak memory: 420.96 MiB, increment: 61.02 MiB

%memit np.core.defchararray.upper(npmstr)
peak memory: 391.48 MiB, increment: 31.52 MiB

到目前为止，还不错；然而，计时结果让我感到惊讶:

%timeit [x.upper() for x in milstr]
129 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.core.defchararray.upper(npmstr)
373 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这是为什么呢？我预计，由于 Numpy 为其数组使用连续的内存块并且其操作是矢量化的(如上面的 numpy 文档页面所述)并且 numpy 字符串数组显然使用更少的内存，因此对它们的操作至少应该更多地在 CPU 缓存上运行-友好的，字符串数组的性能至少与纯 Python 中的性能相似？

环境:

Python 3.6.3 x64，Linux

numpy==1.14.1

最佳答案

在谈论 numpy 时，向量化以两种方式使用，而且并不总是很清楚是什么意思。

对数组的所有元素进行操作
在内部调用优化的(在许多情况下是多线程的)数字代码的操作

第二点是使向量化操作比 Python 中的 for 循环快得多的原因，而多线程部分使它们比列表推导式更快。当这里的评论者说矢量化代码更快时，他们指的也是第二种情况。但是，在 numpy 文档中， vectorized 仅指第一种情况。这意味着您可以直接在数组上使用函数，而不必遍历所有元素并在每个元素上调用它。从这个意义上说，它使代码更简洁，但不一定更快。一些矢量化操作确实会调用多线程代码，但据我所知，这仅限于线性代数例程。就我个人而言，我更喜欢使用矢量化运算，因为我认为它比列表推导更具可读性，即使性能相同也是如此。

现在，对于有问题的代码，np.char(它是 np.core.defchararray 的别名)的文档指出

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

所以有四种方法(不推荐一种)在 numpy 中处理字符串。一些测试是有序的，因为当然每种方法都有不同的优点和缺点。使用如下定义的数组:

npob = np.array(milstr, dtype=np.object_)
npuni = np.array(milstr, dtype=np.unicode_)
npstr = np.array(milstr, dtype=np.string_)
npchar = npstr.view(np.chararray)
npcharU = npuni.view(np.chararray)

这将创建具有以下数据类型的数组(或最后两个的字符数组):

In [68]: npob.dtype
Out[68]: dtype('O')

In [69]: npuni.dtype
Out[69]: dtype('<U10')

In [70]: npstr.dtype
Out[70]: dtype('S10')

In [71]: npchar.dtype
Out[71]: dtype('S10')

In [72]: npcharU.dtype
Out[72]: dtype('<U10')

基准测试针对这些数据类型提供了相当大的性能范围:

%timeit [x.upper() for x in test]
%timeit np.char.upper(test)

# test = milstr
103 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
377 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npob
110 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
<error on second test, vectorized operations don't work with object arrays>

# test = npuni
295 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npstr
125 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npchar
663 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npcharU
887 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
325 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

令人惊讶的是，使用普通的旧字符串列表仍然是最快的。当数据类型为 string_ 或 object_ 时，Numpy 具有竞争力，但一旦包含 unicode，性能就会变得更差。 chararray 是迄今为止最慢的，无论是否处理 unicode。应该清楚为什么不推荐使用它。

使用 unicode 字符串会显着降低性能。 docs说明以下这些类型之间的区别

For backward compatibility with Python 2 the S and a typestrings remain zero-terminated bytes and np.string_ continues to map to np.bytes_. To use actual strings in Python 3 use U or np.unicode_. For signed bytes that do not need zero-termination b or i1 can be used.

在这种情况下，字符集不需要 unicode，使用更快的 string_ 类型是有意义的。如果需要 unicode，您可以通过使用列表或 object_ 类型的 numpy 数组来获得更好的性能(如果需要其他 numpy 功能)。列表何时可能更好的另一个很好的例子是 appending lots of data

因此，从中得出结论:

虽然 Python 通常被认为速度较慢，但它在处理一些常见问题时却非常高效。 Numpy 通常速度很快，但并未针对所有情况进行优化。
阅读文档。如果做事的方式不止一种(通常是这样)，很可能只有一种方式更适合您要尝试做的事情。
不要盲目地假设矢量化代码会更快 - 当您关心性能时，请始终进行概要分析(这适用于任何“优化”技巧)。

关于python - Numpy 中的向量化字符串操作 : why are they rather slow?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49112552/

python - Numpy 中的向量化字符串操作 : why are they rather slow?

上一篇：python - Pandas 仅对某些列求和和计数

下一篇：python - 将多索引连接成 Pandas 系列中的单个索引