python - 值错误: Input array dimensions not right for CountVectorizer()

在 sklearn 管道中使用 make_column_transformer() 时，我在尝试使用 CountVectorizer 时遇到错误。

我的 DataFrame 有两列，'desc-title' 和 'SPchangeHigh'。这是两行的片段:

features = pd.DataFrame([["T. Rowe Price sells most of its Tesla shares", .002152],
                         ["Gannett to retain all seats in MNG proxy fight", 0.002152]],
                        columns=["desc-title", "SPchangeHigh"])

我能够毫无问题地运行以下管道:

preprocess = make_column_transformer(
    (StandardScaler(),['SPchangeHigh']),
    ( OneHotEncoder(),['desc-title'])
)
preprocess.fit_transform(features.head(2))

但是，当我用 CountVectorizer(tokenizer=tokenize) 替换 OneHotEncoder() 时，它失败了:

preprocess = make_column_transformer( (StandardScaler(),['SPchangeHigh']), ( CountVectorizer(tokenizer=tokenize),['desc-title']) ) preprocess.fit_transform(features.head(2))

我得到的错误是这样的:
<小时/>
ValueError Traceback (most recent call last) <ipython-input-71-d77f136b9586> in <module>() 3 ( CountVectorizer(tokenizer=tokenize),['desc-title']) 4 ) ----> 5 preprocess.fit_transform(features.head(2)) C:\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y) 488 self._validate_output(Xs) 489 --> 490 return self._hstack(list(Xs)) 491 492 def transform(self, X): C:\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in _hstack(self, Xs) 545 else: 546 Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs] --> 547 return np.hstack(Xs) 548 549 C:\anaconda3\lib\site-packages\numpy\core\shape_base.py in hstack(tup) 338 return _nx.concatenate(arrs, 0) 339 else: --> 340 return _nx.concatenate(arrs, 1) 341 342 ValueError: all the input array dimensions except for the concatenation axis must match exactly

如果有人能帮助我，我会很感激。

最佳答案

删除“desc-title”周围的括号。您需要一个一维数组，而不是列向量。

preprocess = make_column_transformer( (StandardScaler(),['SPchangeHigh']), ( CountVectorizer(),'desc-title') ) preprocess.fit_transform(features.head(2))

Sklearn documentation describes this nuanced specification :

The difference between specifying the column selector as 'column' (as a simple string) and ['column'] (as a list with one element) is the shape of the array that is passed to the transformer. In the first case, a one dimensional array will be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector

...

Be aware that some transformers expect a 1-dimensional input (the label-oriented ones) while some others, like OneHotEncoder or Imputer, expect 2-dimensional input, with the shape [n_samples, n_features].

关于python - 值错误: Input array dimensions not right for CountVectorizer()，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56298242/

python - 值错误: Input array dimensions not right for CountVectorizer()

上一篇：python - 无法从一些凌乱的脚本中挖掘出格式良好的 json 内容

下一篇：python - 将多维数组转换为二维数组并进行后续索引