python - 如何聚类元组列表(标签，概率)的列表？ - Python

我有一堆文本，它们被分类为类别，然后每个文档被标记为 0、1 或 2，每个标记都有一个概率。

[ "this is a foo bar",
  "bar bar black sheep",
  "sheep is an animal"
  "foo foo bar bar"
  "bar bar sheep sheep" ]

管道中的前一个工具返回一个元组列表列表，外部列表中的每个元素都是一个文档。我只能处理这样一个事实，即我知道每个文档都被标记为 0、1 或 2 以及它们的概率:

[ [(0,0.3), (1,0.5), (2,0.1)],
  [(0,0.5), (1,0.3), (2,0.3)],
  [(0,0.4), (1,0.4), (2,0.5)],
  [(0,0.3), (1,0.7), (2,0.2)],
  [(0,0.2), (1,0.6), (2,0.1)] ]

我需要它来查看每个元组列表中的哪个标记最有可能并实现:

[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,
  [[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,
  [[(0,0.4), (1,0.4), (2,0.5)]] ]

再举个例子:

[输入]:

[ [(0,0.7), (1,0.2), (2,0.4)],
  [(0,0.5), (1,0.9), (2,0.3)],
  [(0,0.3), (1,0.8), (2,0.4)],
  [(0,0.8), (1,0.2), (2,0.2)],
  [(0,0.1), (1,0.7), (2,0.5)] ]

[输出]:

 [[[(0,0.7), (1,0.2), (2,0.4)],
 [(0,0.8), (1,0.2), (2,0.2)]] ,

 [[(0,0.5), (1,0.9), (2,0.3)],
 [(0,0.1), (1,0.7), (2,0.5)],
 [(0,0.3), (1,0.8), (2,0.4)]] ,

 []]

注意:当数据到达我的管道部分时，我无法访问原始文本。

如何用标签和概率对元组列表的列表进行聚类？ numpy、scipy、sklearn 或任何支持 python 的 ML 套件中是否有一些东西可以做到这一点？甚至 NLTK。

假设簇的数量是固定的，但簇的大小不是固定的。

我只尝试找到质心的最大值，但这只给了我每个集群中的第一个值:

instream = [ [(0,0.3), (1,0.5), (2,0.1)],
                        [(0,0.5), (1,0.3), (2,0.3)],
                        [(0,0.4), (1,0.4), (2,0.5)],
                        [(0,0.3), (1,0.7), (2,0.2)],
                        [(0,0.2), (1,0.6), (2,0.1)] ]

# Find centroid.  
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]

c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]

print instream[c1_centroid]
print instream[c2_centroid]
print instream[c2_centroid]

[out](每个簇中的顶部元素:

[(0, 0.5), (1, 0.3), (2, 0.3)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
[(0, 0.3), (1, 0.7), (2, 0.2)]

最佳答案

如果我没理解错的话，这就是你想要的。

import numpy as np

N_TYPES = 3

instream = [ [(0,0.3), (1,0.5), (2,0.1)],
             [(0,0.5), (1,0.3), (2,0.3)],
             [(0,0.4), (1,0.4), (2,0.5)],
             [(0,0.3), (1,0.7), (2,0.2)],
             [(0,0.2), (1,0.6), (2,0.1)] ]
instream = np.array(instream)

# this removes document tags because we only consider probabilities here
values = [map(lambda x: x[1], doc) for doc in instream]

# determine the cluster of each document by using maximum probability
belongs_to = map(lambda x: np.argmax(x), values)
belongs_to = np.array(belongs_to)

# construct clusters of indices to your instream
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]

# apply the indices to obtain full output
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]

输出out:

[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]],

 [[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],
  [[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],
  [[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]],

 [[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]

我使用了 numpy 数组，因为它们支持很好的搜索和索引。例如，表达式 (belongs_to == 1).nonzero()[0] 将索引数组返回到数组 belongs_to，其中值为 1。索引示例是 instream[cluster_indices[2]]。

关于python - 如何聚类元组列表(标签，概率)的列表？ - Python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20990538/

python - 如何聚类元组列表(标签，概率)的列表？ - Python

上一篇：Python 和 NLTK : Baseline tagger

下一篇：python - 我的电脑怎么了？即使我更改了 Python 版本，我总是得到 'cookielib bug'