python - scikit-learn 中的词汇匹配问题?

标签 python machine-learning nlp scikit-learn

我有一个充满 .txt 文件(文档)的目录。首先,我加载文档并去掉一些括号并删除一些引号,因此文档如下所示,例如:

document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model

document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods

所以我从这样的目录加载文件:

preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]


documents = ''.join( i for i in ''.join(str(v) for v
                                              in preprocessDocuments) if i not in "',()")

然后我尝试对 document1document2 进行矢量化,以创建训练矩阵,如下所示:

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()

然后这是输出:

    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

鉴于此,我如何创建矢量表示?我以为我在documents中携带了加载的文件,但似乎无法安装文档。

最佳答案

文档的内容是什么? It looks like它应该是文件名或带有标记的字符串的列表。另外,您应该使用对象调用 fit_transform,而不是像静态方法那样,即。 e. vectorizer.fit_transform(文档)

例如,这在这里有效:

from sklearn.feature_extraction.text import HashingVectorizer
documents=['this is a test', 'another test']
vectorizer = HashingVectorizer(analyzer='word')
X = vectorizer.fit_transform(documents)

关于python - scikit-learn 中的词汇匹配问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27631797/

相关文章:

python - 如何将列表列表除以 Python 中的另一个列表列表?

java - 反向传播算法编程

tensorflow - 将张量输入 CNN 时是否应该转置它

python - 从redis封装Unicode

python - 如何使用另一个字典中的键作为名称从字典中获取值

machine-learning - 对文档中的单词进行分类

java - java opennlp工具包中内置Porter Stemmer

java - 如何复制代词(代词)及其先行词

nlp - 如何使用 NLP 库从句子中提取谓词和主语?

python - 在 pandas 系列中使用前一个 "row"的值