python - 在 Python 中加速 Stanford 依赖解析

有没有更快的方法来实现 CoreNLPParser 或者我应该通过另一个库与 API 交互吗？ 或者我应该翻阅 Java 书籍吗？

我有一个包含 6500 个句子的语料库，我正在通过 nltk.parse.corenlp 中的 CoreNLPParser 方法运行。我已将我正在做的所有其他事情从我的主项目中分离出来，以测试我之前编写的 tree_height 函数。但是，速度是一样的——事实上，这个过程需要15分钟以上才能完成。

这是我的 tree_height 函数:

from nltk.parse.corenlp import CoreNLPParser
Parser = CoreNLPParser(url='http://localhost:9000')
def tree_height(tokenized_sent):
    ddep = Parser.raw_parse(tokenized_sent)
    for i in ddep:
        sent_height = i.height()
    return sent_height

我正在解析西类牙语句子并且之前使用以下命令启动了 CoreNLP 服务器:

java -mx10g -cp "*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-spanish.properties -port 9000 -timeout 15000

我也尝试过将 mx3g 部分更改为 mx5g，这似乎没有太大区别。

我看过 this discussion on GitHub并且正在运行最新版本的 StanfordCoreNLP。

---更新---

我担心我的脚本执行缓慢的原因是效率低下或代码编写不当——所以我试图通过以下方式找出我的代码的低效之处:

在不调用任何 NLP 函数的情况下迭代所有数据(来自 Pandas 数据帧)大约需要 20 秒。
遍历所有数据并且仅句子标记化所有数据大约需要 30 秒
在我最近的尝试中，我将所有标记化的句子添加到一个变量中，并对每个句子迭代调用 tree_height 函数，发现速度没有差异(花费的时间与就像我开始隔离代码之前一样)。

最佳答案

好的，下面是我们正在开发的 Python 接口(interface)的描述。要获得最新版本，您必须从 GitHub 下载并按照安装说明进行操作(很容易理解!!)

转到 GitHub 并克隆 Python 接口(interface)存储库:

https://github.com/stanfordnlp/python-stanford-corenlp

cd 进入目录并输入 python setup.py install

(很快我们将使用 conda 和 pip 等进行设置...，但目前它仍在开发中...您可以获得旧版本现在在 pip 上)

在单独的终端窗口中，启动 Java 服务器:

java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-spanish.properties -port 9000 -timeout 15000

注意:确保在您的 CLASSPATH 中包含所有必需的 jar，或者使用 -cp "*" 选项从包含所有适当 jar 的目录中运行.

运行这段 Python 代码:

import corenlp
client = corenlp.CoreNLPClient(start_server=False, annotators=["tokenize", "ssplit", "pos", "depparse"])
# there are other options for "output_format" such as "json"
# "conllu", "xml" and "serialized"
ann = client.annotate(u"...", output_format="text")

ann 将包含最终注释信息(包括依赖项解析)...这应该比您报告的内容快得多...请尝试一下并告诉我。

关于python - 在 Python 中加速 Stanford 依赖解析，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51366811/

python - 在 Python 中加速 Stanford 依赖解析

上一篇：python - 在不断更新的 matplotlib 中绘图

下一篇：visual-studio-2008 - VS2008安装项目中的自动驱动程序安装