python - 斯坦福大学对 Python NLTK 的普遍依赖

标签 python nlp nltk stanford-nlp

有什么方法可以使用 python 或 nltk 获取通用依赖项?我只能生成解析树。

例子:

输入句子:

My dog also likes eating sausage.

输出:

Universal dependencies

nmod:poss(dog-2, My-1)
nsubj(likes-4, dog-2)
advmod(likes-4, also-3)
root(ROOT-0, likes-4)
xcomp(likes-4, eating-5)
dobj(eating-5, sausage-6)

最佳答案

Wordseer's stanford-corenlp-python fork是一个好的开始,因为它适用于最近的 CoreNLP 版本 (3.5.2)。但是它会给你原始输出,你需要手动转换。例如,假设您正在运行包装器:

>>> import json, jsonrpclib
>>> from pprint import pprint
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>>
>>> pprint(json.loads(server.parse('John loves Mary.')))  # doctest: +SKIP
{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],
                                   [u'nsubj',
                                    u'loves',
                                    u'2',
                                    u'John',
                                    u'1'],
                                   [u'dobj', u'loves', u'2', u'Mary', u'3'],
                                   [u'punct', u'loves', u'2', u'.', u'4']],
                 u'parsetree': [],
                 u'text': u'John loves Mary.',
                 u'words': [[u'John',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'4',
                              u'Lemma': u'John',
                              u'PartOfSpeech': u'NNP'}],
                            [u'loves',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10',
                              u'Lemma': u'love',
                              u'PartOfSpeech': u'VBZ'}],
                            [u'Mary',
                             {u'CharacterOffsetBegin': u'11',
                              u'CharacterOffsetEnd': u'15',
                              u'Lemma': u'Mary',
                              u'PartOfSpeech': u'NNP'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'15',
                              u'CharacterOffsetEnd': u'16',
                              u'Lemma': u'.',
                              u'PartOfSpeech': u'.'}]]}]}

如果你想使用依赖解析器,你可以通过一些努力来重用 NLTK 的 DependencyGraph

>>> import jsonrpclib, json
>>> from nltk.parse import DependencyGraph
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>> parses = json.loads(
...    server.parse(
...       'John loves Mary. '
...       'I saw a man with a telescope. '
...       'Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.'
...    )
... )['sentences']
>>>
>>> def transform(sentence):
...     for rel, _, head, word, n in sentence['dependencies']:
...         n = int(n)
...
...         word_info = sentence['words'][n - 1][1]
...         tag = word_info['PartOfSpeech']
...         lemma = word_info['Lemma']
...         if rel == 'root':
...             # NLTK expects that the root relation is labelled as ROOT!
...             rel = 'ROOT'
...
...         # Hack: Return values we don't know as '_'.
...         #       Also, consider tag and ctag to be equal.
...         # n is used to sort words as they appear in the sentence.
...         yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'
...
>>> dgs = [
...     DependencyGraph(
...         ' '.join(items)  # NLTK expects an iterable of strings...
...         for n, *items in sorted(transform(parse))
...     )
...     for parse in parses
... ]
>>>
>>> # Play around with the information we've got.
>>>
>>> pprint(list(dgs[0].triples()))
[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),
 (('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),
 (('loves', 'VBZ'), 'punct', ('.', '.'))]
>>>
>>> print(dgs[1].tree())
(saw I (man a (with (telescope a))) .)
>>>
>>> print(dgs[2].to_conll(4))  # doctest: +NORMALIZE_WHITESPACE
Ballmer     NNP     4       nsubj
has         VBZ     4       aux
been        VBN     4       cop
vocal       JJ      0       ROOT
in          IN      4       prep
the         DT      8       det
past        JJ      8       amod
warning     NN      5       pobj
that        WDT     13      dobj
Linux       NNP     13      nsubj
is          VBZ     13      cop
a           DT      13      det
threat      NN      8       rcmod
to          TO      13      prep
Microsoft   NNP     14      pobj
.           .       4       punct
<BLANKLINE>

设置 CoreNLP 并不难,查看 http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html了解更多详情。

关于python - 斯坦福大学对 Python NLTK 的普遍依赖,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32153627/

相关文章:

java - 在 Python/Java 中创建文件时每秒获得写入速度

nlp - Rasa core 和 Rasa nlu 之间的区别

machine-learning - KeyedVector 中的 Gensim Doc2Vec.infer_vector() 等效项

python - 如何建立一个问答系统来回答"is"或“否”

python-3.x - python中文本清理/处理的管道

python : insert 2 dimensional array into Mysql Database

python - Tkinter,一次处理多个帧,每个帧中都有按钮。但是按钮忽略了给定的框架,只使用网格

python - Pandas 将数据帧与包含日期时间的系列进行比较

python - 在不计算整个句子的情况下估计给定句子的标记概率/logits

python - 在 nltk 中打断/分解复杂和复合句子