python - 在 Python 中运行 C 扩展比普通 C 更快

我在 C 中实现了 Python 扩展，发现在 Python 中执行 C 函数比仅从 C main 中执行 C 代码快 2 倍。

但为什么这样更快？我希望纯 C 在从 Python 调用时与从 C 调用时具有完全相同的性能。

这是我的实验:

普通 C 计算代码(矩阵-矩阵乘法的简单 3)
调用 mmult() 函数的纯 C 主函数
调用 mmult() 函数的 Python 扩展包装器
所有时间都完全在 C 代码中发生

这是我的结果:

纯 C - 85us

Python 扩展 - 36us

这是我的代码:

--mmult.cpp------------

#include "mmult.h"

void mmult(int32_t a[1024],int32_t b[1024],int32_t c[1024]) {

  struct timeval t1, t2;
  gettimeofday(&t1, NULL);

  for(int i=0; i<32; i=i+1) {
    for(int j=0; j<32; j=j+1) {
        int32_t result=0;
         for(int k=0; k<32; k=k+1) {
           result+=a[i*32+k]*b[k*32+j];
         }
         c[i*32+j] = result;
      }
  }

  gettimeofday(&t2, NULL);

  double elapsedTime = (t2.tv_usec - t1.tv_usec) + (t2.tv_sec - t1.tv_sec)*1000000;
  printf("elapsed time: %fus\n",elapsedTime);

}

--mmult.h--------

#include <stdint.h>

void mmult(int32_t a[1024],int32_t b[1024],int32_t c[1024]);

--main.cpp------

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include "mmult.h"

int main() {
  int* a = (int*)malloc(sizeof(int)*1024);
  int* b = (int*)malloc(sizeof(int)*1024);
  int* c = (int*)malloc(sizeof(int)*1024);

  for(int i=0; i<1024; i++) {
    a[i]=i+1;
    b[i]=i+1;
    c[i]=0;
  }

  struct timeval t1, t2;
  gettimeofday(&t1, NULL);
  mmult(a,b,c);
  gettimeofday(&t2, NULL);

  double elapsedTime = (t2.tv_usec - t1.tv_usec) + (t2.tv_sec - t1.tv_sec)*1000000;
  printf("elapsed time: %fus\n",elapsedTime);
  free(a);
  free(b);
  free(c);

  return 0;
}

下面是我编译 main 的方式:

gcc -o main main.cpp mmult.cpp -O3

--wrapper.cpp-----

#include <Python.h>
#include <numpy/arrayobject.h>
#include "mmult.h"

static PyObject* mmult_wrapper(PyObject* self, PyObject* args) {
   int32_t* a;
   PyArrayObject* a_obj = NULL;
   int32_t* b;
   PyArrayObject* b_obj = NULL;
   int32_t* c;
   PyArrayObject* c_obj = NULL;

   int res = PyArg_ParseTuple(args, "OOO", &a_obj, &b_obj, &c_obj);

   if (!res)
      return NULL;

   a = (int32_t*) PyArray_DATA(a_obj);
   b = (int32_t*) PyArray_DATA(b_obj);
   c = (int32_t*) PyArray_DATA(c_obj);

   /* call function */
   mmult(a,b,c);

   Py_RETURN_NONE;
}

/*  define functions in module */
static PyMethodDef TheMethods[] = {
   {"mmult_wrapper", mmult_wrapper, METH_VARARGS, "your c function"},
   {NULL, NULL, 0, NULL}
};

static struct PyModuleDef cModPyDem = {
   PyModuleDef_HEAD_INIT,
   "mmult", "Some documentation",
   -1,
   TheMethods
};

PyMODINIT_FUNC
PyInit_c_module(void) {
   PyObject* retval = PyModule_Create(&cModPyDem);
   import_array();
   return retval;
}

--setup.py-----

import os
import numpy
from distutils.core import setup, Extension
cur = os.path.dirname(os.path.realpath(__file__))
c_module = Extension("c_module", sources=["wrapper.cpp","mmult.cpp"],include_dirs=[cur,numpy.get_include()])
setup(ext_modules=[c_module])

--代码.py-----

import c_module
import time
import numpy as np
if __name__ == "__main__":
    a = np.ndarray((32,32),dtype='int32',buffer=np.linspace(1,1024,1024,dtype='int32').reshape(32,32))
    b = np.ndarray((32,32),dtype='int32',buffer=np.linspace(1,1024,1024,dtype='int32').reshape(32,32))
    c = np.ndarray((32,32),dtype='int32',buffer=np.zeros((32,32),dtype='int32'))

    c_module.mmult_wrapper(a,b,c)

下面是我如何编译 Python 扩展:

python3.6 setup_sw.py build_ext --inplace

更新

我更新了 mmult.cpp 代码以在内部运行 3for 1,000,000 次迭代。这导致了非常相似的时间:

纯 C - 27us

Python 扩展 - 27us

最佳答案

85 微秒的延迟太小，无法可靠地重复测量。例如，CPU cache effects(或 context switches 或 paging )可能会支配计算时间(并改变它以使该时间变得毫无意义)。

^{(我猜你在 Linux/x86-64 上)}

根据经验，尝试运行至少持续约半秒，然后重复几次基准测试。您也可以使用 time(1)用于测量。

另见 time(7) .有几个时间概念(经过的“实时”时间、单调时间、进程 cpu 时间、线程 cpu 时间等)。您可以考虑使用 clock(3)或 clock_gettime(2)测量时间。

顺便说一句，您可以使用更新版本的 GCC 进行编译(2017 年 11 月，GCC7 和几周后的 GCC8)并且您想使用 gcc -march=native -O3 进行编译以进行基准测试。也试试其他 optimization options和调整。您也可以尝试其他编译器，例如Clang/LLVM .

另请参阅 this回答(关于并行化)相关问题。可能是 numpy包正在使用(内部)类似的技术(在 Python GIL 之外)，因此可能比 C 中的原始顺序矩阵乘法代码更快。

关于python - 在 Python 中运行 C 扩展比普通 C 更快，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47554439/

python - 在 Python 中运行 C 扩展比普通 C 更快

上一篇：c - 函数局部变量每次执行时是否总是存储在同一组内存位置中？

下一篇：c - union 或 struct 允许从未初始化的实例赋值吗？