python - 二叉解析树表示的字符串列表

假设我有一个用括号表示的二叉树，如下所示:

best_parse = "(ROOT (S (S (S (S (S Our) (S intent)) (S is)) (S (S to) (S (S promote) (S (S (S the) (S best)) (S alternative))))) (S (S he) (S says))))"

我可以使用 tree_to_spans 方法从中获取跨度(成分)，该方法为我提供 [(0, 2), (0, 3), (5 , 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)]

import nltk
import collections

def tree_to_spans(tree):
    if isinstance(tree, str):
        tree = nltk.Tree.fromstring(tree)
    length = len(tree.pos())
    queue = collections.deque(tree.treepositions())
    stack = [(queue.popleft(), 0)]
    j = 0
    spans = []
    while stack != []:
        (p, i) = stack[-1]
        if not queue or queue[0][:-1] != p:
            if isinstance(tree[p], nltk.tree.Tree):
                if j - i > 1:
                    spans.append((tree[p].label(), (i, j)))
            else:
                j = i + 1
            stack.pop()
        else:
            q = queue.popleft()
            stack.append((q, j))
    return spans

我还可以使用 get_constituents 方法获取与跨度相对应的字符串，该方法为我提供了树的所有成分。

['我们的意图', '我们的意图是', '最好的', '最佳替代方案', '促进最佳替代方案', '促进最佳替代方案', '我们的意图是推广最佳替代方案”，“他说”，“我们的目的是推广他所说的最佳替代方案”]

def get_constituents(sample_string):
    t = nltk.Tree.fromstring(sample_string)
    spans = evaluate.tree_to_spans(t)
    sentence = " ".join(item[0] for item in t.pos()).split()
    constituents = [" ".join(sentence[span[0]: span[1]])for span in spans]
    # Add original sentence
    constituents = constituents + [" ".join(sentence)]
    return constituents

问题:我坚持将其转换回给定所有成分列表的树(如开头所示的树)。基本上，逆向工程。假设给出句子:

s = "Our intent is to promote the best alternative he says"

编辑

假设可以从列表(部分)中删除某些元素。例如 -

parts = [
    'Our intent', 
    'the best', 
    'the best alternative', 
    'promote the best alternative', 
    'to promote the best alternative', 
    'Our intent is to promote the best alternative', 
    'Our intent is to promote the best alternative he says'
]

此处，“他说” 和 “我们的意图是” 已从列表中删除。我想要获得与上面相同格式的树结构，除了已删除成分的树结构。

另一种看待它的方式是:

考虑一下，我们有跨度 -

spans = [(0, 2), (0, 3), (5, 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)]

我删除了 (0, 3) 和 (8, 10)。

我想把括号放在上面，如下所示:

(((0  1  2)  (3  (4  ((5  6  7)  8))))  9  10)

然后，我们可以将索引与相应的字符串进行映射。

EDIT-2

例如，如果我们只从部件中删除“他说”和“我们的意图是”，那么我们最终的括号形式的树应该如下所示:

"(ROOT (S (S (S (S (S Our) (S intent)) (S is) (S (S to) (S (S promote) (S (S (S the) (S best)) (S alternative)))))(S he) (S says))))")

另一个例子，如果我们只从部件中删除“促进最佳替代方案”，和“我们的意图是促进最佳替代方案”，我们的括号中的最终树应如下所示:

"(ROOT (S (S (S Our) (S intent)) (S is)) (S to) (S (S promote) (S (S (S the) (S best)) (S alternative))) (S (S he) (S says)))"

我们可以假设完整的句子“我们的意图是推广他所说的最佳替代方案”永远不会被删除。对于句子中的单个单词来说也是如此，只是为了给您提供背景信息。

最佳答案

我将以相反的顺序迭代成分，以便获得树的有序遍历(右侧先于左侧遍历)。因此，在此解决方案中，我假设成分的顺序与您从代码中获取成分的顺序相同。

通过递归，您可以递归地重建每个子树:

def to_tree(parts):
    i = len(parts) - 1

    def recur(part, expect=False):
        nonlocal i
        words = part.split()
        if len(words) == 1:  # leaf
            return "(S {})".format(words[0])
        if expect and i > 0 and parts[i-1] == part:
            i -= 1
        if len(words) == 2:  # 2 leaves
            return "(S (S {}) (S {}))".format(*words)
        i -= 1
        nextpart = parts[i]
        if part.endswith(" " + nextpart):
            right = recur(nextpart)
            left = recur(part[0:-len(nextpart)-1], True)
        elif part.startswith(nextpart + " "):
            right = recur(part[len(nextpart)+1:], True)
            left = recur(nextpart)
        else: 
            sides = part.split(" " + nextpart + " ")
            assert len(sides) == 2, "Could not parse input" 
            right = recur(sides[1], True)
            left = recur(sides[0], True)
        return "(S {} {})".format(left, right)
            
    return "(ROOT {})".format(recur(parts[i]))

该示例可以运行为:

parts = [
    'Our intent', 
    'Our intent is', 
    'the best', 
    'the best alternative', 
    'promote the best alternative', 
    'to promote the best alternative', 
    'Our intent is to promote the best alternative', 
    'he says', 
    'Our intent is to promote the best alternative he says'
]

print(to_tree(parts))

...这将输出您的原始字符串。

根据您的编辑，上述代码可以在输入中的某些删除中“幸存”。例如，可以从输入中删除“我们的意图是”和“他说”，并且输出仍然相同。但也有局限性。如果删除太多，树就无法再重建。

关于python - 二叉解析树表示的字符串列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65469508/

python - 二叉解析树表示的字符串列表

上一篇：ethereum - ganache-cli 如何从帐户 json 文件中读取私钥

下一篇：maven - 集成测试为打包的 jar 创建文件