python - 如何从多维缓冲区初始化 NumPy 数组？

numpy.frombuffer 的文档函数明确表示生成的数组将是一维的:

Interpret a buffer as a 1-dimensional array.

我不确定这句话的后果。文档只是告诉我生成的数组将是一维的，但从未说输入缓冲区必须描述一维对象。

我有一个(2D)Eigen matrix在 C++ 中。我想创建一个 Python buffer它描述了矩阵的内容。然后，我想使用这个缓冲区以某种方式初始化我的 NumPy 数组并使其可用于我的 python 脚本。目标是在不复制数据的情况下将信息传递给 Python，并允许 Python 修改矩阵(例如初始化矩阵)。

等效于 numpy.frombuffer 的 C-API是 PyArray_FromBuffer ，它也共享单维短语，但它有更多的文档(强调我的):

PyObject* PyArray_FromBuffer(PyObject* buf, PyArray_Descr* dtype, npy_intp count, npy_intp offset)

Construct a one-dimensional ndarray of a single type from an object, buf, that exports the (single-segment) buffer protocol (or has an attribute __buffer__ that returns an object that exports the buffer protocol). A writeable buffer will be tried first followed by a read- only buffer. The NPY_ARRAY_WRITEABLE flag of the returned array will reflect which one was successful. The data is assumed to start at offset bytes from the start of the memory location for the object. The type of the data in the buffer will be interpreted depending on the data- type descriptor, dtype. If count is negative then it will be determined from the size of the buffer and the requested itemsize, otherwise, count represents how many elements should be converted from the buffer.

“单段”是否意味着它不能包含使用的填充，例如，对齐矩阵的行？在那种情况下，我被搞砸了，因为我的矩阵可以很好地使用需要填充的对齐策略。

回到最初的问题:

有没有办法让我创建一个与预先存在的缓冲区共享内存的 NumPy 数组？

备注:github上有一个项目叫Eigen3ToPython ，旨在将 eigen 与 python 连接，但它不允许内存共享(重点是我的):

This library allows to: [...] Convert to/from Numpy arrays (np.array) in a transparent manner (however, memory is not shared between both representations)

编辑
有人可能会指出类似标题的问题 Numpy 2D- Array from Buffer? .不幸的是，那里给出的解决方案对我来说似乎不是一个有效的解决方案，因为生成的二维数组不与原始缓冲区共享内存。

编辑:如何在 Eigen 中组织数据

Eigen 使用跨步访问将 2D 矩阵映射到一维内存缓冲区中。例如，一个 double 3x2 矩阵需要 6 个 double ，即 48 个字节。分配了一个 48 字节的缓冲区。此缓冲区中的第一个元素表示 [0, 0]矩阵中的条目。

为了访问元素 [i, j] ，使用以下公式:

double* v = matrix.data() + i*matrix.rowStride() + j*matrix.colStride()

，其中 matrix是矩阵对象及其成员函数data() , rowStride()和 colStride()分别返回缓冲区的起始地址、连续两行之间的距离和连续两列之间的距离(以浮点格式大小的倍数表示)。

默认情况下，Eigen 使用列优先格式，因此 rowStride() == 1 ，但也可以配置为使用行优先格式，使用 colStride() == 1 .

另一个重要的配置选项是对齐。数据缓冲区很可能包含一些不需要的值(即不属于矩阵的值)，以便使列或行从对齐的地址开始。这使得矩阵上的操作可矢量化。在上面的例子中，假设列优先格式和 16 字节对齐，下面的矩阵

3   7
1  -2
4   5

可以存储赢得以下缓冲区:

0  0  3  1  4  0  7 -2  5  0

0 值称为填充。开头的两个 0 可能是必要的，以确保实际数据的开头与同一边界对齐。 (注意 data() 成员函数将返回 3 的地址。)在这种情况下，行和列的步幅是

rowStride: 1
colStride: 4

(而在未对齐的情况下，它们将分别为 1 和 3。)

Numpy 需要一个 C 连续缓冲区，即没有填充的行主要结构。如果 Eigen 没有插入填充，那么对于列优先 Eigen 矩阵可以很容易地解决行优先要求的问题:将缓冲区传递给一个 numpy 数组，结果 ndarray被重塑和转置。我设法完美地完成了这项工作。

但是，如果 Eigen 确实插入了填充，则使用此技术无法解决该问题，因为 ndarray仍然会看到数据中的零并认为它们是矩阵的一部分，同时丢弃数组末尾的一些值。和此是我要求解决的问题。

现在，顺便说一句，由于我们有幸让 @ggael 参与循环，他可能会有所启发，我不得不承认我从未让 Eigen 在我的矩阵中插入任何填充。而且我似乎没有在 Eigen 文档中找到任何关于填充的提及。但是，我希望对齐策略对齐每一列(或行)，而不仅仅是第一列。我的期望错了吗？如果我是，那么整个问题不适用于 Eigen。但它适用于我正在使用的其他库，它们应用了我上面描述的对齐策略，所以在回答问题时请不要考虑最后一段。

最佳答案

我在这里回答我自己的问题。感谢@user2357112 指出正确的方向:我需要的是 PyArray_NewFromDescr .

以下 Python 对象是 Eigen 矩阵的包装器:

struct PyEigenMatrix {
    PyObject_HEAD
    Eigen::Matrix<RealT, Eigen::Dynamic, Eigen::Dynamic> matrix;
};

RealT是我使用的浮点类型(在我的例子中是 float)。

为了返回一个 np.ndarray对象，我在类中添加了一个成员函数:

static PyObject*
PyEigenMatrix_as_ndarray(PyEigenMatrix* self, PyObject* args, PyObject* kwds)
{
    // Extract number of rows and columns from Eigen matrix
    npy_intp dims[] = { self->matrix.rows(), self->matrix.cols() };

    // Extract strides from Eigen Matrix (multiply by type size to get bytes)
    npy_intp strides[] = {
        self->matrix.rowStride() * (npy_intp)sizeof(RealT),
        self->matrix.colStride() * (npy_intp)sizeof(RealT)
    };

    // Create and return the ndarray
    return PyArray_NewFromDescr(
            &PyArray_Type,                  // Standard type
            PyArray_DescrFromType(typenum), // Numpy type id
            2,                              // Number of dimensions
            dims,                           // Dimension array
            strides,                        // Strides array
            self->matrix.data(),            // Pointer to data
            NPY_ARRAY_WRITEABLE,            // Flags
            (PyObject*)self                 // obj (?)
        );
}

typenum是 numpy type id number .

这个调用创建一个新的 numpy 数组，给它一个缓冲区(通过 data 参数)，使用 dims 描述缓冲区和 strides参数(前者也设置返回数组的形状)，描述数据类型，将矩阵设置为读写(通过flags参数。

我不确定最后一个参数是什么 obj意思是虽然。文档仅在类型与 PyArray_Type 不同的情况下才提到它。 .

为了说明这在实践中是如何工作的，让我展示一些 python 代码。

In [3]: m = Matrix(7, 3)

In [4]: m
Out[4]: 
  0.680375  -0.211234   0.566198
   0.59688   0.823295  -0.604897
 -0.329554   0.536459  -0.444451
   0.10794 -0.0452059   0.257742
 -0.270431  0.0268018   0.904459
   0.83239   0.271423   0.434594
 -0.716795   0.213938  -0.967399

In [5]: a = m.as_ndarray()

In [6]: a
Out[6]: 
array([[ 0.68 , -0.211,  0.566],
       [ 0.597,  0.823, -0.605],
       [-0.33 ,  0.536, -0.444],
       [ 0.108, -0.045,  0.258],
       [-0.27 ,  0.027,  0.904],
       [ 0.832,  0.271,  0.435],
       [-0.717,  0.214, -0.967]], dtype=float32)

In [7]: a[2, 1] += 4

In [8]: a
Out[8]: 
array([[ 0.68 , -0.211,  0.566],
       [ 0.597,  0.823, -0.605],
       [-0.33 ,  4.536, -0.444],
       [ 0.108, -0.045,  0.258],
       [-0.27 ,  0.027,  0.904],
       [ 0.832,  0.271,  0.435],
       [-0.717,  0.214, -0.967]], dtype=float32)

In [9]: m
Out[9]: 
  0.680375  -0.211234   0.566198
   0.59688   0.823295  -0.604897
 -0.329554    4.53646  -0.444451
   0.10794 -0.0452059   0.257742
 -0.270431  0.0268018   0.904459
   0.83239   0.271423   0.434594
 -0.716795   0.213938  -0.967399

Matrix是我的 PyEigenMatrix类型。我加了一个 __repr__使用 Eigen 的流运算符打印矩阵的函数。我可以有一个 ndarray a这完全对应于特征矩阵。当我修改 a ( In[7] )，不仅 numpy 数组被修改 ( Out[8] )，而且底层特征数组 ( Out[9] ) 也被修改，表明两个对象共享相同的内存。

编辑 @ user2357112 说对了两次。他在评论中提出的第二种方法也有效。如果类型 PyEigenMatrix导出缓冲区接口(interface)(我的类型这样做)，然后解决方案就像创建 memoryview 一样简单对象，或者 in Python或使用 C-API ，并将这个对象传递给 np.array函数，也指定 copy=False .

下面是它的工作原理:

In [2]: m = Matrix(7, 3)

In [3]: mv = memoryview(m)    

In [4]: a = np.array(mv, copy=False)

In [5]: m
Out[5]: 
  0.680375   0.536459   0.904459
 -0.211234  -0.444451    0.83239
  0.566198    0.10794   0.271423
   0.59688 -0.0452059   0.434594
  0.823295   0.257742  -0.716795
 -0.604897  -0.270431   0.213938
 -0.329554  0.0268018  -0.967399

In [6]: a
Out[6]: 
array([[ 0.68 ,  0.536,  0.904],
       [-0.211, -0.444,  0.832],
       [ 0.566,  0.108,  0.271],
       [ 0.597, -0.045,  0.435],
       [ 0.823,  0.258, -0.717],
       [-0.605, -0.27 ,  0.214],
       [-0.33 ,  0.027, -0.967]], dtype=float32)

In [7]: a [3, 1] += 2

In [8]: a
Out[8]: 
array([[ 0.68 ,  0.536,  0.904],
       [-0.211, -0.444,  0.832],
       [ 0.566,  0.108,  0.271],
       [ 0.597,  1.955,  0.435],
       [ 0.823,  0.258, -0.717],
       [-0.605, -0.27 ,  0.214],
       [-0.33 ,  0.027, -0.967]], dtype=float32)

In [9]: m
Out[9]: 
 0.680375  0.536459  0.904459
-0.211234 -0.444451   0.83239
 0.566198   0.10794  0.271423
  0.59688   1.95479  0.434594
 0.823295  0.257742 -0.716795
-0.604897 -0.270431  0.213938
-0.329554 0.0268018 -0.967399

这种方法的优点是不需要 numpy C-API。矩阵类型只需要支持缓冲协议(protocol)，这比直接依赖numpy的方法更通用。

关于python - 如何从多维缓冲区初始化 NumPy 数组？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46692086/

python - 如何从多维缓冲区初始化 NumPy 数组？

上一篇：c++ - 从 cin 读取不同类型的输入

下一篇：c++根据随机生成的数组检查用户输入的值