python - 如何比较集群？

希望这可以用python来完成!我在同一数据上使用了两个聚类程序，现在这两个程序都有一个聚类文件。我重新格式化了文件，使它们看起来像这样:

Cluster 0:
Brucellaceae(10)
    Brucella(10)
        abortus(1)
        canis(1)
        ceti(1)
        inopinata(1)
        melitensis(1)
        microti(1)
        neotomae(1)
        ovis(1)
        pinnipedialis(1)
        suis(1)
Cluster 1:
    Streptomycetaceae(28)
        Streptomyces(28)
            achromogenes(1)
            albaduncus(1)
            anthocyanicus(1)

etc.

这些文件包含细菌种类信息。所以我有簇号(簇 0)，然后在它的正下方“家族”(布鲁氏菌科)和该家族中的细菌数量(10)。在此之下是该科中发现的属(名称后跟数字，布鲁氏菌 (10))，最后是每个属中的物种(流产属 (1) 等)。

我的问题:我有 2 个以这种方式格式化的文件，我想编写一个程序来查找两者之间的差异。唯一的问题是两个程序以不同的方式聚类，所以两个聚类可能相同，即使实际的“聚类编号”不同(因此一个文件中聚类 1 的内容可能与另一个文件中的聚类 43 匹配，唯一不同的是实际的簇号)。所以我需要一些东西来忽略簇号并专注于簇内容。

有什么方法可以比较这 2 个文件以检查差异？有可能吗？任何想法将不胜感激!

最佳答案

给定:

file1 = '''Cluster 0:
 giant(2)
  red(2)
   brick(1)
   apple(1)
Cluster 1:
 tiny(3)
  green(1)
   dot(1)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')
file2 = '''Cluster 18:
 giant(2)
  red(2)
   brick(1)
   tomato(1)
Cluster 19:
 tiny(2)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')

这是你需要的吗？

def parse_file(open_file):
    result = []

    for line in open_file:
        indent_level = len(line) - len(line.lstrip())
        if indent_level == 0:
            levels = ['','','']
        item = line.lstrip().split('(', 1)[0]
        levels[indent_level - 1] = item
        if indent_level == 3:
            result.append('.'.join(levels))
    return result

data1 = set(parse_file(file1))
data2 = set(parse_file(file2))

differences = [
    ('common elements', data1 & data2),
    ('missing from file2', data1 - data2),
    ('missing from file1', data2 - data1) ]

查看差异:

for desc, items in differences:
    print desc
    print 
    for item in items:
        print '\t' + item
    print

打印

common elements

    giant.red.brick
    tiny.blue.candy
    tiny.blue.flower

missing from file2

    tiny.green.dot
    giant.red.apple

missing from file1

    giant.red.tomato

关于python - 如何比较集群？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17866915/

python - 如何比较集群？

上一篇：algorithm - 以递增顺序和最佳方式打印 (3^i *7^j) 的值

下一篇：algorithm - 高效的二维切割算法