python - 如何聚类元组列表(标签,概率)的列表? - Python

标签 python numpy machine-learning scikit-learn

我有一堆文本,它们被分类为类别,然后每个文档被标记为 0、1 或 2,每个标记都有一个概率。

[ "this is a foo bar",
  "bar bar black sheep",
  "sheep is an animal"
  "foo foo bar bar"
  "bar bar sheep sheep" ]

管道中的前一个工具返回一个元组列表列表,外部列表中的每个元素都是一个文档。我只能处理这样一个事实,即我知道每个文档都被标记为 0、1 或 2 以及它们的概率:

[ [(0,0.3), (1,0.5), (2,0.1)],
  [(0,0.5), (1,0.3), (2,0.3)],
  [(0,0.4), (1,0.4), (2,0.5)],
  [(0,0.3), (1,0.7), (2,0.2)],
  [(0,0.2), (1,0.6), (2,0.1)] ]

我需要它来查看每个元组列表中的哪个标记最有可能并实现:

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,
  [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,
  [[(0,0.4), (1,0.4), (2,0.5)]] ]

再举个例子:

[输入]:

[ [(0,0.7), (1,0.2), (2,0.4)],
  [(0,0.5), (1,0.9), (2,0.3)],
  [(0,0.3), (1,0.8), (2,0.4)],
  [(0,0.8), (1,0.2), (2,0.2)],
  [(0,0.1), (1,0.7), (2,0.5)] ]

[输出]:

 [[[(0,0.7), (1,0.2), (2,0.4)],
 [(0,0.8), (1,0.2), (2,0.2)]] ,

 [[(0,0.5), (1,0.9), (2,0.3)],
 [(0,0.1), (1,0.7), (2,0.5)],
 [(0,0.3), (1,0.8), (2,0.4)]] ,

 []]

注意:当数据到达我的管道部分时,我无法访问原始文本。

如何用标签和概率对元组列表的列表进行聚类? numpyscipysklearn 或任何支持 python 的 ML 套件中是否有一些东西可以做到这一点?甚至 NLTK

假设簇的数量是固定的,但簇的大小不是固定的。

我只尝试找到质心的最大值,但这只给了我每个集群中的第一个值:

instream = [ [(0,0.3), (1,0.5), (2,0.1)],
                        [(0,0.5), (1,0.3), (2,0.3)],
                        [(0,0.4), (1,0.4), (2,0.5)],
                        [(0,0.3), (1,0.7), (2,0.2)],
                        [(0,0.2), (1,0.6), (2,0.1)] ]

# Find centroid.  
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]

print instream[c1_centroid]
print instream[c2_centroid]
print instream[c2_centroid]

[out](每个簇中的顶部元素:

[(0, 0.5), (1, 0.3), (2, 0.3)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
[(0, 0.3), (1, 0.7), (2, 0.2)]

最佳答案

如果我没理解错的话,这就是你想要的。

import numpy as np

N_TYPES = 3

instream = [ [(0,0.3), (1,0.5), (2,0.1)],
             [(0,0.5), (1,0.3), (2,0.3)],
             [(0,0.4), (1,0.4), (2,0.5)],
             [(0,0.3), (1,0.7), (2,0.2)],
             [(0,0.2), (1,0.6), (2,0.1)] ]
instream = np.array(instream)

# this removes document tags because we only consider probabilities here
values = [map(lambda x: x[1], doc) for doc in instream]

# determine the cluster of each document by using maximum probability
belongs_to = map(lambda x: np.argmax(x), values)
belongs_to = np.array(belongs_to)

# construct clusters of indices to your instream
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]

# apply the indices to obtain full output
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]   

输出out:

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]],

 [[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],
  [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],
  [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]],

 [[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]

我使用了 numpy 数组,因为它们支持很好的搜索和索引。例如,表达式 (belongs_to == 1).nonzero()[0] 将索引数组返回到数组 belongs_to,其中值为 1。索引示例是 instream[cluster_indices[2]]

关于python - 如何聚类元组列表(标签,概率)的列表? - Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20990538/

相关文章:

python - 消除 Python 和 Numpy 构造中的 for 循环

python - 用另一个列表替换列表中的最后一个元素

google-app-engine - 谷歌云平台和谷歌机器学习

python - Python 的 "super"如何做正确的事?

python - Python 中的二维数组

python - 设置 seaborn 联合图的轴刻度值

python - 使用 pandas/numpy 数据框以另一列的条目(特征值)为条件操作特定列(示例特征)

python - Pandas 、numpy.where() 和 numpy.nan

machine-learning - 机器学习: features that don't apper

machine-learning - 梯度下降似乎失败了