我有一堆文本,它们被分类为类别,然后每个文档被标记为 0、1 或 2,每个标记都有一个概率。
[ "this is a foo bar",
"bar bar black sheep",
"sheep is an animal"
"foo foo bar bar"
"bar bar sheep sheep" ]
管道中的前一个工具返回一个元组列表列表,外部列表中的每个元素都是一个文档。我只能处理这样一个事实,即我知道每个文档都被标记为 0、1 或 2 以及它们的概率:
[ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
我需要它来查看每个元组列表中的哪个标记最有可能并实现:
[ [[(0,0.5), (1,0.3), (2,0.3)], [(0,0.4), (1,0.4), (2,0.5)]] ,
[[(0,0.3), (1,0.7), (2,0.2)], [(0,0.2), (1,0.6), (2,0.1)]] ,
[[(0,0.4), (1,0.4), (2,0.5)]] ]
再举个例子:
[输入]
:
[ [(0,0.7), (1,0.2), (2,0.4)],
[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.3), (1,0.8), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)],
[(0,0.1), (1,0.7), (2,0.5)] ]
[输出]
:
[[[(0,0.7), (1,0.2), (2,0.4)],
[(0,0.8), (1,0.2), (2,0.2)]] ,
[[(0,0.5), (1,0.9), (2,0.3)],
[(0,0.1), (1,0.7), (2,0.5)],
[(0,0.3), (1,0.8), (2,0.4)]] ,
[]]
注意:当数据到达我的管道部分时,我无法访问原始文本。
如何用标签和概率对元组列表的列表进行聚类? numpy
、scipy
、sklearn
或任何支持 python 的 ML 套件中是否有一些东西可以做到这一点?甚至 NLTK
。
假设簇的数量是固定的,但簇的大小不是固定的。
我只尝试找到质心的最大值,但这只给了我每个集群中的第一个值:
instream = [ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
# Find centroid.
c1_centroid_value = sorted([i[0] for i in instream], reverse=True)[0]
c2_centroid_value = sorted([i[1] for i in instream], reverse=True)[0]
c3_centroid_value = sorted([i[2] for i in instream], reverse=True)[0]
c1_centroid = [i for i,j in enumerate(instream) if j[0] == c1_centroid_value][0]
c2_centroid = [i for i,j in enumerate(instream) if j[1] == c2_centroid_value][0]
c3_centroid = [i for i,j in enumerate(instream) if j[2] == c3_centroid_value][0]
print instream[c1_centroid]
print instream[c2_centroid]
print instream[c2_centroid]
[out]
(每个簇中的顶部元素:
[(0, 0.5), (1, 0.3), (2, 0.3)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
[(0, 0.3), (1, 0.7), (2, 0.2)]
最佳答案
如果我没理解错的话,这就是你想要的。
import numpy as np
N_TYPES = 3
instream = [ [(0,0.3), (1,0.5), (2,0.1)],
[(0,0.5), (1,0.3), (2,0.3)],
[(0,0.4), (1,0.4), (2,0.5)],
[(0,0.3), (1,0.7), (2,0.2)],
[(0,0.2), (1,0.6), (2,0.1)] ]
instream = np.array(instream)
# this removes document tags because we only consider probabilities here
values = [map(lambda x: x[1], doc) for doc in instream]
# determine the cluster of each document by using maximum probability
belongs_to = map(lambda x: np.argmax(x), values)
belongs_to = np.array(belongs_to)
# construct clusters of indices to your instream
cluster_indices = [(belongs_to == k).nonzero()[0] for k in range(N_TYPES)]
# apply the indices to obtain full output
out = [instream[cluster_indices[k]].tolist() for k in range(N_TYPES)]
输出out
:
[[[[0.0, 0.5], [1.0, 0.3], [2.0, 0.3]]],
[[[0.0, 0.3], [1.0, 0.5], [2.0, 0.1]],
[[0.0, 0.3], [1.0, 0.7], [2.0, 0.2]],
[[0.0, 0.2], [1.0, 0.6], [2.0, 0.1]]],
[[[0.0, 0.4], [1.0, 0.4], [2.0, 0.5]]]]
我使用了 numpy
数组,因为它们支持很好的搜索和索引。例如,表达式 (belongs_to == 1).nonzero()[0]
将索引数组返回到数组 belongs_to
,其中值为 1
。索引示例是 instream[cluster_indices[2]]
。
关于python - 如何聚类元组列表(标签,概率)的列表? - Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20990538/