有什么方法可以使用 python 或 nltk 获取通用依赖项?我只能生成解析树。
例子:
输入句子:
My dog also likes eating sausage.
输出:
Universal dependencies
nmod:poss(dog-2, My-1)
nsubj(likes-4, dog-2)
advmod(likes-4, also-3)
root(ROOT-0, likes-4)
xcomp(likes-4, eating-5)
dobj(eating-5, sausage-6)
最佳答案
Wordseer's stanford-corenlp-python fork是一个好的开始,因为它适用于最近的 CoreNLP 版本 (3.5.2)。但是它会给你原始输出,你需要手动转换。例如,假设您正在运行包装器:
>>> import json, jsonrpclib
>>> from pprint import pprint
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>>
>>> pprint(json.loads(server.parse('John loves Mary.'))) # doctest: +SKIP
{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],
[u'nsubj',
u'loves',
u'2',
u'John',
u'1'],
[u'dobj', u'loves', u'2', u'Mary', u'3'],
[u'punct', u'loves', u'2', u'.', u'4']],
u'parsetree': [],
u'text': u'John loves Mary.',
u'words': [[u'John',
{u'CharacterOffsetBegin': u'0',
u'CharacterOffsetEnd': u'4',
u'Lemma': u'John',
u'PartOfSpeech': u'NNP'}],
[u'loves',
{u'CharacterOffsetBegin': u'5',
u'CharacterOffsetEnd': u'10',
u'Lemma': u'love',
u'PartOfSpeech': u'VBZ'}],
[u'Mary',
{u'CharacterOffsetBegin': u'11',
u'CharacterOffsetEnd': u'15',
u'Lemma': u'Mary',
u'PartOfSpeech': u'NNP'}],
[u'.',
{u'CharacterOffsetBegin': u'15',
u'CharacterOffsetEnd': u'16',
u'Lemma': u'.',
u'PartOfSpeech': u'.'}]]}]}
如果你想使用依赖解析器,你可以通过一些努力来重用 NLTK 的 DependencyGraph
>>> import jsonrpclib, json
>>> from nltk.parse import DependencyGraph
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>> parses = json.loads(
... server.parse(
... 'John loves Mary. '
... 'I saw a man with a telescope. '
... 'Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.'
... )
... )['sentences']
>>>
>>> def transform(sentence):
... for rel, _, head, word, n in sentence['dependencies']:
... n = int(n)
...
... word_info = sentence['words'][n - 1][1]
... tag = word_info['PartOfSpeech']
... lemma = word_info['Lemma']
... if rel == 'root':
... # NLTK expects that the root relation is labelled as ROOT!
... rel = 'ROOT'
...
... # Hack: Return values we don't know as '_'.
... # Also, consider tag and ctag to be equal.
... # n is used to sort words as they appear in the sentence.
... yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'
...
>>> dgs = [
... DependencyGraph(
... ' '.join(items) # NLTK expects an iterable of strings...
... for n, *items in sorted(transform(parse))
... )
... for parse in parses
... ]
>>>
>>> # Play around with the information we've got.
>>>
>>> pprint(list(dgs[0].triples()))
[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),
(('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),
(('loves', 'VBZ'), 'punct', ('.', '.'))]
>>>
>>> print(dgs[1].tree())
(saw I (man a (with (telescope a))) .)
>>>
>>> print(dgs[2].to_conll(4)) # doctest: +NORMALIZE_WHITESPACE
Ballmer NNP 4 nsubj
has VBZ 4 aux
been VBN 4 cop
vocal JJ 0 ROOT
in IN 4 prep
the DT 8 det
past JJ 8 amod
warning NN 5 pobj
that WDT 13 dobj
Linux NNP 13 nsubj
is VBZ 13 cop
a DT 13 det
threat NN 8 rcmod
to TO 13 prep
Microsoft NNP 14 pobj
. . 4 punct
<BLANKLINE>
设置 CoreNLP 并不难,查看 http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html了解更多详情。
关于python - 斯坦福大学对 Python NLTK 的普遍依赖,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32153627/