python - 在 Python 中使用 scikit-learn kmeans 对文本文档进行聚类

标签 python python-2.7 scikit-learn cluster-analysis k-means

我需要实现 scikit-learn's kMeans用于聚类文本文档。 example code工作正常,但需要一些 20newsgroups 数据作为输入。我想使用相同的代码来聚类文档列表,如下所示:

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

我需要在 kMeans example code 中进行哪些更改将此列表用作输入? (简单地采用“数据集 = 文档”是行不通的)

最佳答案

这是一个更简单的例子:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

向量化文本,即将字符串转换为数字特征

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

集群文档

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

打印每个集群集群的热门词

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print

如果您想更直观地了解它的外观,请参阅 this answer .

关于python - 在 Python 中使用 scikit-learn kmeans 对文本文档进行聚类,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27889873/

相关文章:

python - 如何显示 LinearRegression() 的权重和偏差?

python - jinja2 中带有 html 标签的 Django 模板

python - 用于回归的 Scikit-learn 交叉验证评分

python - StopIteration 异常是否会自动通过我的迭代器向上传播?

python - 关于从 PyCharm 安装 SciPy

python - 使用 matplotlib 创建词云

python - 我想使用 Python 字符串格式化表达式将数字格式化为百分比,但它失败了

python-2.7 - 无法从朴素贝叶斯分类器生成 ROC-AUC 曲线

python - 查找(并记录)numpy 数组切片的最大值

python - Neo4J 查询创建新关系或替换现有关系