python - 查找 "best"个完整的子图

在优化我的应用程序的性能时，我在几行 (Python) 代码中遇到了巨大的性能瓶颈。

我有 N 个 token 。每个 token 都有一个分配给它的值。一些标记相互矛盾(例如标记 8 和 12 不能“共存”)。我的工作是找到 k-best token-groups。一组 token 的值只是其中 token 值的总和。

朴素算法(我已经实现...):

找到标记的所有 2^N 个标记组排列
消除其中有矛盾的token-groups
计算所有剩余 token 组的值
按值对 token 组排序
选择前 K 个 token 组

真实世界的数字——我需要从一组 20 个标记中选出前 10 个标记组(我为此计算了 1,000,000 个排列 (!))，缩小到 3500 个不矛盾的标记组。这在我的笔记本电脑上花了 5 秒钟......

我确信我可以通过仅生成不矛盾的标记组来以某种方式优化步骤 1+2。

我也很确定我能以某种方式神奇地在一次搜索中找到最好的标记组，并找到一种通过递减值来遍历标记组的方法，从而找到我正在寻找的 10 个最佳标记组。 ..

我的实际代码:

all_possibilities = sum((list(itertools.combinations(token_list, i)) for i in xrange(len(token_list)+1)), [])
all_possibilities = [list(option) for option in all_possibilities if self._no_contradiction(option)] 
all_possibilities = [(option, self._probability(option)) for option in all_possibilities]
all_possibilities.sort(key = lambda result: -result[1]) # sort by descending probability

请帮忙？

塔尔。

最佳答案

步骤 1+2 的简单方法可能如下所示:首先，定义一个标记列表和一个矛盾字典(每个键是一个标记，每个值是一组标记)。然后，对每个 token 执行两个操作:

将其添加到 result(如果它尚未矛盾)，并增加 conflicting 集，其中包含与当前添加的标记相矛盾的标记
不要将它添加到结果(选择忽略它)并移动到下一个标记。

所以这是一个示例代码:

token_list = ['a', 'b', 'c']

contradictions = {
    'a': set(['b']),
    'b': set(['a']),
    'c': set()
}

class Generator(object):
    def __init__(self, token_list, contradictions):
        self.list = token_list
        self.contradictions = contradictions
        self.max_start = len(self.list) - 1

    def add_no(self, start, result, conflicting):
        if start < self.max_start:
            for g in self.gen(start + 1, result, conflicting):
                yield g
        else:
            yield result[:]

    def add_yes(self, token, start, result, conflicting):
        result.append(token)
        new_conflicting = conflicting | self.contradictions[token]
        for g in self.add_no(start, result, new_conflicting):
            yield g
        result.pop()

    def gen(self, start, result, conflicting):
        token = self.list[start]
        if token not in conflicting:
            for g in self.add_yes(token, start, result, conflicting):
                yield g
        for g in self.add_no(start, result, conflicting):
            yield g

    def go(self):
        return self.gen(0, [], set())

示例用法:

g = Generator(token_list, contradictions)
for x in g.go():
    print x

这是一种递归算法，因此它不适用于超过几千个标记(由于 Python 的堆栈限制)，但您可以轻松地创建一个非递归算法。

关于python - 查找 "best"个完整的子图，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3132802/

python - 查找 "best"个完整的子图

上一篇：Python 代码重构问题。简化

下一篇：python - 如果在 Python 中将相同的字符串添加到两个不同的列表或集合中，是否使用了两倍的内存？