python - 等同于 Python 中 R 的 removeSparseTerms

标签 python r machine-learning scikit-learn tm

<分区>

关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。

我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。

关闭 7 年前。

Improve this question

我们正在进行一个数据挖掘项目，并使用 R 中 tm 包中的 removeSparseTerms 函数来减少文档术语矩阵的特征。

但是，我们希望将代码移植到 python。 sklearn、nltk 或其他一些包中是否有可以提供相同功能的函数？

谢谢!

最佳答案

如果你的数据是纯文本，你可以使用CountVectorizer为了完成这项工作。

例如:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)

现在 X 是文档术语矩阵。 (如果您从事信息检索，您还需要考虑 Tf–idf term weighting。

它可以帮助您通过几行代码轻松获得文档术语矩阵。

关于稀疏性——您可以控制这些参数:

min_df - 文档-术语矩阵中术语允许的最小文档频率。
max_features - 文档-术语矩阵中允许的最大特征数

或者，如果您已经有了文档项矩阵或 Tf-idf 矩阵，并且您知道什么是稀疏的，请定义 MIN_VAL_ALLOWED，然后执行以下操作:

import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2

X = csr_matrix([[7,8,0],
                [2,1,1],
                [5,5,0]])

z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms 

print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]

(使用 X = X[:,z] 所以 X 仍然是一个 csr_matrix。)

如果它是您希望设置阈值的最低文档频率，binarize首先是矩阵，然后以相同的方式使用它:

import numpy as np
from scipy.sparse import csr_matrix

MIN_DF_ALLOWED = 2

X = csr_matrix([[7, 1.3, 0.9, 0],
                [2, 1.2, 0.8  , 1],
                [5, 1.5, 0  , 0]])

#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print  X[:,z].toarray()
#prints
[[ 7.   1.3]
[ 2.   1.2]
[ 5.   1.5]]

在这个例子中，第三和第四项(或列)消失了，因为它们只出现在两个文档(行)中。使用 MIN_DF_ALLOWED 设置阈值。

关于python - 等同于 Python 中 R 的 removeSparseTerms，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31109464/

上一篇：python - 从无缓冲的 os.fdopen() 文件对象中读取的行为不像 os.read()

下一篇：python - 当条件为非 bool 值时，如何在列表理解中使用 "else"？

R:在多页的网格布局中绘图

r - 自定义 SHAPforxgboost 图中的标签

python - 2个隐藏层神经网络的维度不相关

python - 在 PyDev 中，如何为方法返回的实例获取自动完成功能？

python - 如何使用 Python 向 Viber 机器人发送消息？

python - 如何在OpenCV中将某个RGB值的所有像素替换为另一个RGB值

在 data.table 中运行回归

r - 替代 R 中的子集？

machine-learning - 我应该如何分割数据以进行交叉验证和网格搜索？