python - TFIDF 计算困惑

标签 python data-mining text-processing information-retrieval tf-idf

我在网上找到了下面这段计算TFIDF的代码:

https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py

我在函数 def idf(word, documentList) 中添加了“1+”，这样我就不会除以 0 错误:

return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList))))

但我对两件事感到困惑:

在某些情况下我得到负值，这是正确的吗？
我对第 62、63 和 64 行感到困惑。

代码:

 documentNumber = 0
  for word in documentList[documentNumber].split(None):
       words[word] = tfidf(word,documentList[documentNumber],documentList)

是否应该只在第一个文档上计算 TFIDF？

最佳答案

没有。 tf-idf 是 tf，一个非负值，乘以 idf，一个非负值，所以它永远不会是负数。这段代码似乎正在实现 erroneous definition of tf-idf多年来一直在维基百科上(同时它是 fixed)。

关于python - TFIDF 计算困惑，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/16648599/

上一篇：python - Numpy 中的卷积比 Matlab 中的卷积慢吗？

下一篇：python - 用于谷歌云数据存储的 ORM

相关文章：

math - 基于某些加权标准计算 'similar'对象的方法

python - Python中<>的含义

python - 如何用自然数替换二进制数组中的 1，同时保持 0 的位置？

python - 通过机器学习寻找日常模式

linux - 无法从 Linux bash 中的文本文件中删除隐藏字符

bash - 内联 LaTeX\input 命令

shell - 检查文本文件中的两个变量

python - 二级嵌套内联未显示在 django-nested-inlines 中

python - 如何在 django Rest 框架中序列化当前用户？

hex - 查找十六进制模式和出现次数