python - sklearn管道ValueError : all the input array dimensions except for the concatenation axis must match exactly

我有一个 sklearn 管道，它提取三个不同的特征。

manual_feats = Pipeline([
        ('FeatureUnion', FeatureUnion([
            ('segmenting_pip1', Pipeline([
                ('A_features', A_features()),
                ('segmentation', segmentation())
            ])),
            ('segmenting_pip2', Pipeline([
                ('B_features', B_features(),
                ('segmentation', segmentation())
            ])),
            ('segmenting_pip3', Pipeline([
                ('Z_features', Z_features()),
                ('segmentation', segmentation())
            ])),

        ])),
    ])

鉴于功能 A 和 B each 返回一个暗淡数组(记录数，10, 20)，而 Z 返回 (记录数, 10, 15)。

当我为管道安装所有功能时，我收到此错误:

 File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 451, in _transform
    Xt = transform.transform(Xt)
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 829, in transform
    Xs = np.hstack(Xs)
  File "C:\Python35\lib\site-packages\numpy\core\shape_base.py", line 340, in hstack
    return _nx.concatenate(arrs, 1)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

但是，如果我排除功能 Z ，管道可以工作，但应用于 axis=1 的连接会变暗(记录数，20, 20)。我想要的是获取一个(记录数，10, 40)维度的数组，其中串联过程应用于 axis=2。

如何使用 Pipeline 获得我想要的东西，而不需要编辑库的源代码？

编辑: 我提到过 A 和 B 的串联会产生一个 (# ofrecords, 10, 40) DIM 数组。这是不正确的；它生成一个 DIM 数组(记录数，20, 20)。我将编辑问题。

最佳答案

我通过创建一个处理串联过程的转换器解决了这个问题。

class append_split_3D(BaseEstimator, TransformerMixin):
    def __init__(self, segments_number=20, max_len=50, mode='append'):
        self.segments_number = segments_number
        self.max_len = max_len
        self.mode = mode
        self.appending_value = -5.123

    def fit(self, X, y=None):
        return self

    def transform(self, data):
        if self.mode == 'append':
            self.max_len = self.max_len - data.shape[2]
            appending = np.full((data.shape[0], data.shape[1], self.max_len), self.appending_value)
            new = np.concatenate([data, appending], axis=2)
            return new
        elif self.mode == 'split':
            tmp = []
            for item in range(0, data.shape[1], self.segments_number):
                tmp.append(data[:, item:(item + self.segments_number), :])
            tmp = [item[item != self.appending_value].reshape(data.shape[0], self.segments_number, -1) for item in tmp]
            new = np.concatenate(tmp, axis=2)
            return new
        else:
            print('Error: Mode value is not defined')
            exit(1)

完整的管道变成这样:

manual_feats = Pipeline([
        ('FeatureUnion', FeatureUnion([
            ('segmenting_pip1', Pipeline([
                ('A_features', A_features()),
                ('segmentation', segmentation()),
                ('append', append_split_3D(max_len=50, mode='append')),
            ])),
            ('segmenting_pip2', Pipeline([
                ('B_features', B_features(),
                ('segmentation', segmentation())
                ('append', append_split_3D(max_len=50, mode='append')),
            ])),
            ('segmenting_pip3', Pipeline([
                ('Z_features', Z_features()),
                ('segmentation', segmentation())
                ('append', append_split_3D(max_len=50, mode='append')),
            ])),

        ])),
        ('split', append_split_3D(segments_number=10, mode='split')),
    ])

我在这个变压器中做了以下事情: 例如，我的功能 A、B 和 Z 返回以下数组:

A:(记录数，10、20)
B:(记录数，10, 20)
Z:(记录数，10、15)

在 mode='append' 中，我将最大长度值为 50 的额外固定值(作为示例)附加到所有数组，以具有相同的 axis=2 变暗并允许函数 Xs = np.hstack(Xs) 工作。

因此，管道将返回一个数组:(记录数, 30, 50)

然后，在 mode=split' 中，我将其添加到管道的末尾，将最终数组拆分为其附加形状:(记录数, 30, 50 ) 到 3 个暗淡特征数组 (记录数, 10, 50)

然后我删除额外的固定值，并对最后一个暗淡应用串联。

最终数组的暗淡为:(记录数, 10, 55)。 55 是数组的第三维 (20+20+15) 的串联，这就是我想要的。

关于python - sklearn管道ValueError : all the input array dimensions except for the concatenation axis must match exactly，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58527800/

python - sklearn管道ValueError : all the input array dimensions except for the concatenation axis must match exactly

上一篇：python - 使用字典中的键从数据帧中检索值

下一篇：python - 在 AWS EMR 上提交 pyspark 支持 zip 文件内的 sql 文件