python - 为什么这个 Python 脚本在多核上的运行速度比在单核上慢 4 倍

我想了解 CPython 的 GIL 是如何工作的，以及 CPython 2.7.x 和 CPython 3.4.x 中的 GIL 有什么区别。我正在使用此代码进行基准测试:

from __future__ import print_function

import argparse
import resource
import sys
import threading
import time


def countdown(n):
    while n > 0:
        n -= 1


def get_time():
    stats = resource.getrusage(resource.RUSAGE_SELF)
    total_cpu_time = stats.ru_utime + stats.ru_stime
    return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime


def get_time_diff(start_time, end_time):
    return tuple((end-start) for start, end in zip(start_time, end_time))


def main(total_cycles, max_threads, no_headers=False):
    header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" %
              ("#t", "seq_r", "seq_c", "seq_u", "seq_s",
               "par_r", "par_c", "par_u", "par_s"))
    row_format = ("%(threads)4d "
                  "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f "
                  "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f")
    if not no_headers:
        print(header)
    for thread_count in range(1, max_threads+1):
        # We don't care about a few lost cycles
        cycles = total_cycles // thread_count

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]

        start_time = get_time()
        for thread in threads:
            thread.start()
            thread.join()
        end_time = get_time()
        sequential = get_time_diff(start_time, end_time)

        threads = [threading.Thread(target=countdown, args=(cycles,))
                   for i in range(thread_count)]
        start_time = get_time()
        for thread in threads:
            thread.start()
        for thread in threads:
            thread.join()
        end_time = get_time()
        parallel = get_time_diff(start_time, end_time)

        print(row_format % {"threads": thread_count,
                            "seq_r": sequential[0],
                            "seq_c": sequential[1],
                            "seq_u": sequential[2],
                            "seq_s": sequential[3],
                            "par_r": parallel[0],
                            "par_c": parallel[1],
                            "par_u": parallel[2],
                            "par_s": parallel[3]})


if __name__ == "__main__":
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("max_threads", nargs="?",
                            type=int, default=5)
    arg_parser.add_argument("total_cycles", nargs="?",
                            type=int, default=50000000)
    arg_parser.add_argument("--no-headers",
                            action="store_true")
    args = arg_parser.parse_args()
    sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))

在我的四核 i5-2500 机器上运行此脚本时，在 Ubuntu 14.04 和 Python 2.7.6 下，我得到以下结果(_r 代表实时，_c 代表 CPU 时间，_u 代表用户模式，_s 代表内核模式):

  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.47     1.47     1.47     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.74     0.00     3.33     5.45     3.52     1.93
   3     1.87     1.90     1.90     0.00     3.08     6.42     3.77     2.65
   4     1.78     1.83     1.83     0.00     3.73     6.18     3.88     2.30
   5     1.73     1.79     1.79     0.00     3.74     6.26     3.87     2.39

现在如果我将所有线程绑定(bind)到一个核心，结果就会大不相同:

taskset -c 0 python countdown.py 
  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s
   1     1.46     1.46     1.46     0.00     1.46     1.46     1.46     0.00
   2     1.74     1.74     1.73     0.00     1.69     1.68     1.68     0.00
   3     1.47     1.47     1.47     0.00     1.58     1.58     1.54     0.04
   4     1.74     1.74     1.74     0.00     2.02     2.02     1.87     0.15
   5     1.46     1.46     1.46     0.00     1.91     1.90     1.75     0.15

所以问题是:为什么在多核上运行此 Python 代码与在单核上运行相比，挂钟慢 1.5-2 倍，CPU 时钟慢 4-5 倍？

四处询问和谷歌搜索产生了两个假设:

在多个内核上运行时，可以重新安排线程在不同的内核上运行，这意味着本地缓存失效，因此速度变慢。
在一个核心上挂起一个线程并在另一个核心上激活它的开销比在同一核心上挂起和激活线程的开销要大。

还有其他原因吗？我想了解发生了什么，并能够用数字来支持我的理解(这意味着如果减速是由于缓存未命中，我想查看并比较两种情况下的数字)。

最佳答案

这是由于当多个 native 线程竞争 GIL 时，GIL 会发生抖动。 David Beazley 关于此主题的 Material 将告诉您您想知道的一切。

参见 info here一个很好的图形表示正在发生的事情。

Python3.2 对 GIL 进行了更改以帮助解决此问题，因此您应该会看到 3.2 及更高版本的性能有所提高。

还应注意，GIL 是该语言的 cpython 引用实现的实现细节。其他实现如 Jython 没有 GIL，也不会遇到这个特殊问题。

The rest of D. Beazley's info on the GIL也会对你有所帮助。

要具体回答有关为什么在涉及多核时性能会差很多的问题，请参阅 Inside the GIL 的幻灯片 29-41。介绍。它详细讨论了多核 GIL 争用，而不是单核上的多线程。幻灯片 32 特别表明，随着您添加内核，线程信号开销导致的系统调用数量激增。这是因为线程现在在不同的内核上同时运行，这使它们能够参与真正的 GIL 战斗。与共享单个 CPU 的多个线程相反。上述演示文稿的一个很好的摘要是:

With multiple cores, CPU-bound threads get scheduled simultaneously (on different cores) and then have a GIL battle.

关于python - 为什么这个 Python 脚本在多核上的运行速度比在单核上慢 4 倍，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31386382/

python - 为什么这个 Python 脚本在多核上的运行速度比在单核上慢 4 倍

上一篇：python - 为什么等效字符串的 ID 之间会出现奇怪的行为？

下一篇：python - ipython 中的制表符完成列表元素