nlp - 使用 CoreNLP 将句子分割成子句

标签 nlp stanford-nlp dependency-parsing natural-language-processing pycorenlp

我正在研究以下问题:我想使用 Stanford CoreNLP 将句子拆分为子句。例句可以是:

"Richard is working with CoreNLP, but does not really understand what he is doing"

我现在希望将我的句子拆分为单个“S”，如下面的树状图所示:

我希望输出是一个带有单个“S”的列表，如下所示:

['Richard is working with CoreNLP', ', but', 'does not really understand what', 'he is doing']

如果有任何帮助，我将不胜感激 :)

最佳答案

我怀疑您正在寻找的工具是 Tregex ，电源点中有更详细的描述 here或 Javadoc类本身。

在您的情况下，我相信您正在寻找的模式只是S。所以，像这样:

tregex.sh “S” <path_to_file>

文件是 Penn Treebank 格式的树——也就是说，类似于 (ROOT (S (NP (NNS dogs)) (VP (VB chase) (NP (NNS cats))))).

顺便说一句:我相信片段“，但是”实际上并不是一个句子，正如您在图中突出显示的那样。相反，您突出显示的节点包含整个句子“Richard is working with CoreNLP, but does not really understand what he is doing”。然后 Tregex 会将整个句子作为匹配项之一打印出来。类似地，“does not really understand what”不是一个句子，除非它包含整个 SBAR:“does not understand what he is doing”。

如果您只想要“叶子”句子(即一个句子不包含在另一个句子中)，您可以尝试更像这样的模式:

S !>> S

注意:我没有测试这些模式——使用风险自负!

关于nlp - 使用 CoreNLP 将句子分割成子句，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53155057/

上一篇：indexing - 如何在 AppCode 中设置索引位置？

下一篇：spring-boot - 用于比较日期的 Spring Boot JPA 规范

java - 斯坦福解析器java错误

nlp - 如何使用动词时态/语气制作宽敞的匹配器模式？

NLP:判断一个句子是否传达了特定的语义

python - RegexpTokenizer 日语句子 - python

python - 来自 tfhub 的 BERT 速度慢并且不使用 GPU

python - 是否可以使用 Google BERT 来计算两个文本文档之间的相似度？

python - 创建 edu.stanford.nlp.time.TimeExpressionExtractorImpl 时出错

java - 使用 stanford pos tagger 进行阿拉伯语标记

python - 如何使用spacy查找句子是否包含名词？