python - 为什么 statistics.mean() 这么慢？

我比较了 statistics 模块的 mean 函数与简单的 sum(l)/len(l) 方法的性能，并由于某种原因，发现 mean 函数非常慢。我使用 timeit 和下面的两个代码片段来比较它们，有谁知道是什么导致了执行速度的巨大差异？我正在使用 Python 3.5。

from timeit import repeat
print(min(repeat('mean(l)',
                 '''from random import randint; from statistics import mean; \
                 l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

上面的代码在我的机器上执行大约需要 0.043 秒。

from timeit import repeat
print(min(repeat('sum(l)/len(l)',
                 '''from random import randint; from statistics import mean; \
                 l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

上面的代码在我的机器上执行大约需要 0.000565 秒。

最佳答案

Python 的 statistics 模块不是为速度而构建的，而是为精度而构建的

在 the specs for this module ，看来

The built-in sum can lose accuracy when dealing with floats of wildly differing magnitude. Consequently, the above naive mean fails this "torture test"

assert mean([1e30, 1, 3, -1e30]) == 1

returning 0 instead of 1, a purely computational error of 100%.

Using math.fsum inside mean will make it more accurate with float data, but it also has the side-effect of converting any arguments to float even when unnecessary. E.g. we should expect the mean of a list of Fractions to be a Fraction, not a float.

相反，如果我们看一下这个模块中 _sum() 的实现，方法的文档字符串的第一行 seem to confirm that :

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    [...] """

是的，statistics 实现了 sum，而不是对 Python 的内置 sum() 函数的简单单行调用, 本身需要大约 20 行代码，其中包含一个嵌套的 for 循环。

发生这种情况是因为 statistics._sum 选择保证它可能遇到的所有类型的数字(即使它们彼此之间存在很大差异)的最大精度，而不是简单地强调速度。

因此，内置的 sum 证明快一百倍似乎很正常。在你碰巧用奇异的数字来调用它时，它的精度要低得多。

其他选项

如果您需要在算法中优先考虑速度，您应该查看 Numpy相反，其算法是用 C 实现的。

NumPy 平均值远不如 statistics 精确，但它实现了(自 2013 年以来)routine based on pairwise summation这比天真的 sum/len 更好(链接中的更多信息)。

不过……

import numpy as np
import statistics

np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])

print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))

> NumPy mean: 0.0
> Statistics mean: 1.0

关于python - 为什么 statistics.mean() 这么慢？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37533666/

python - 为什么 statistics.mean() 这么慢？

上一篇：python - 是否可以编译用 Python 编写的程序？

下一篇：python - 使用 HTTP 代理 - Python