python - Scikit Learn - 从特征数组的语料库而不是原始文档的语料库计算 TF-IDF

Scikit-Learn 的 TfidfVectorizer 将原始文档集合转换为 TF-IDF 特征矩阵。我想将特征名称矩阵转换为 TF-IDF 特征，而不是原始文档。

您输入 fit_transform() 的语料库应该是一组原始文档，但我希望能够将它(或类似函数)输入一组数组每个文档的功能。例如:

corpus = [
    ['orange', 'red', 'blue'],
    ['orange', 'yellow', 'red'],
    ['orange', 'green', 'purple (if you believe in purple)'],
    ['orange', 'reddish orange', 'black and blue']
]

... 与一维字符串数组相对。

我知道我可以为 TfidfVectorizer 定义我自己的词汇表以供使用，因此我可以轻松地在我的语料库中创建独特的特征及其在特征向量中的索引。但该函数仍然需要原始文档，并且由于我的特征长度不同并且偶尔会重叠(例如，“橙色”和“红橙色”)，我不能只是将我的特征连接成单个字符串并使用 ngram。

是否有其他 Scikit-Learn 函数可供我使用，但我没有找到？有没有办法使用我没有看到的 TfidfVectorizer？还是我必须自制自己的 TF-IDF 函数才能执行此操作？

最佳答案

您可以编写自定义函数来覆盖内置的预处理器和分词器。

来自文档:

Preprocessor - A callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.

Tokenizer - A callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.

在这种情况下，没有要执行的预处理(因为没有原始文档)。标记化也是不必要的，因为我们已经有了特征数组。因此，我们可以这样做:

tfidf = TfidfVectorizer(preprocessor=lambda x: x, tokenizer=lambda x: x)
tfidf_matrix = tfidf.fit_transform(corpus)

我们通过简单地使用 lambda x: x 传递整个语料库来跳过预处理器和分词器步骤。一旦内置分析器接收到特征数组，它就会自行构建词汇表，并照常在“标记化”语料库上执行 TF-IDF。

关于python - Scikit Learn - 从特征数组的语料库而不是原始文档的语料库计算 TF-IDF，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32591629/

python - Scikit Learn - 从特征数组的语料库而不是原始文档的语料库计算 TF-IDF

上一篇：python - 使用 Python 读取 16 位 PNG 图像文件

下一篇：Python，区分自定义异常