假设我有一个句子的树字符串:
s = "(TOP (S (NP-TMP (NP (DT This) (NN time)) (ADVP (RP around))) (NP-SBJ (PRP they)) (VP (VBP 're) (VP (VBG moving) (ADVP (RB even) (RBR faster))))))"
我想将它转换成这样的括号结构:
"(((This time)(around))(they)(('re)((moving)(even faster))))"
我尝试执行以下操作:
import nltk
s = "(TOP (S (NP-TMP (NP (DT This) (NN time)) (ADVP (RP around))) (NP-SBJ (PRP they)) (VP (VBP 're) (VP (VBG moving) (ADVP (RB even) (RBR faster))))))"
tree = nltk.Tree.fromstring(s)
out = "("
for subtrees in tree:
# there are threee subtrees
# print(len(subtree))
for i, subtree in enumerate(subtrees):
if len(subtree) > 1:
out += "("
for bracketing in range(len(subtree)):
# print(subtree[bracketing])
flattened_tree = subtree[bracketing].flatten()
flattened_string = str(flattened_tree)
flattened_string = flattened_string.replace(flattened_tree.label() + " ", "")
print(flattened_string)
out += flattened_string
if len(subtree) > 1:
out += ")"
# break
out += ")"
print(out)
# (((This time)(around))(they)(('re)(moving even faster)))
编辑:
如果您看到的话,“This”
和 “time”
是同一个父项 “NP”
的一部分。因此,它们成为连续的组成部分,即(这次)
。
然而,“around”
是单个单词组成部分,尽管是同一左子树的一部分。所以,它变成了((这次)(大约))
。
类似地,对于右子树的情况 - “'re”
和 “'移动得更快”
,我们看到 “移动”
以及“甚至更快”
共享同一个父级,“VP”
。
所以,它变成了,(('re)((移动)(甚至更快))
。
最佳答案
如果我理解正确的话,当原子字符串有一个不是原子字符串但已经是括号组合的同级时,您希望向原子字符串添加括号,这样您就永远不会有这样的模式:
(moving (even faster))
或者换句话说,空格只能作为两个原子字符串之间的分隔符。
我会通过两次递归来完成此操作:
第一个将树转换为嵌套列表的方法。这将更容易区分上述规则。
第二次将该嵌套列表转换为最终的括号字符串。
代码:
import nltk
def tolist(tree):
if isinstance(tree[0], str):
return tree[0]
res = [tolist(subtree) for subtree in tree]
leaves = sum(isinstance(child, str) for child in res)
if 0 < leaves < len(res): # if there is a mix of leaves and subtrees....
# ...then wrap every leaf in a list
return [[child] if isinstance(child, str) else child for child in res]
return res
def tostr(lst):
if isinstance(lst[0], str): # assume all list members are strings
return "(" + " ".join(lst) + ")"
return "(" + "".join([tostr(sub) for sub in lst]) + ")"
# run on sample data
s = "(TOP (S (NP-TMP (NP (DT This) (NN time)) (ADVP (RP around))) (NP-SBJ (PRP they)) (VP (VBP 're) (VP (VBG moving) (ADVP (RB even) (RBR faster))))))"
tree = nltk.Tree.fromstring(s)
res = tostr(tolist(tree[0]))
print(res)
关于python - 树字符串到括号字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64756646/