java - 使用 Stanford CoreNLP 进行惰性解析以仅获取特定句子的情绪

我正在寻找优化 Stanford CoreNLP 情感管道性能的方法。因此，A 想要获得句子的情感，但只有那些包含作为输入给出的特定关键字的句子。

我尝试了两种方法:

方法 1:StanfordCoreNLP 管道用情感注释整个文本

我已经定义了注释器管道:tokenize、ssplit、parse、sentiment。我在整篇文章中运行它，然后在每个句子中查找关键字，如果存在，则运行一个返回关键字值的方法。虽然处理需要几秒钟，但我并不满意。

这是代码:

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation annotation = pipeline.process(text); // takes 2 seconds!!!!
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

方法 2:StanfordCoreNLP 管道用句子注释整个文本，在感兴趣的句子上运行单独的注释器

由于第一种方案性能较差，我定义了第二种方案。我已经定义了一个带有注释器的管道:tokenize、ssplit。我在每个句子中查找关键字，如果它们存在，我只为这个句子创建一个注释并在其上运行下一个注释器:ParserAnnotator、BinarizerAnnotator 和 SentimentAnnotator。

因为ParserAnnotator，结果真的不尽如人意。即使我用相同的属性初始化它。有时，它比方法 1 中在文档上运行整个管道所花费的时间还要多。

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit"); // parsing, sentiment removed
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// initiation of annotators to be run on sentences
ParserAnnotator parserAnnotator = new ParserAnnotator("pa", props);
BinarizerAnnotator  binarizerAnnotator = new BinarizerAnnotator("ba", props);
SentimentAnnotator sentimentAnnotator = new SentimentAnnotator("sa", props);

Annotation annotation = pipeline.process(text); // takes <100 ms
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        // code required to perform annotation on one sentence
        List<CoreMap> listWithSentence = new ArrayList<CoreMap>();
        listWithSentence.add(sentence);
        Annotation sentenceAnnotation  = new Annotation(listWithSentence);

        parserAnnotator.annotate(sentenceAnnotation); // takes 50 ms up to 2 seconds!!!!
        binarizerAnnotator.annotate(sentenceAnnotation);
        sentimentAnnotator.annotate(sentenceAnnotation);
        sentence = sentenceAnnotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);

        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

问题

我想知道为什么 CoreNLP 中的解析不是“惰性”的？ (在我的示例中，这意味着:仅在调用句子的情绪时执行)。是性能原因吗？
为什么一个句子的解析器几乎可以像整篇文章的解析器一样工作(我的文章有 7 个句子)？是否可以以运行速度更快的方式对其进行配置？

最佳答案

如果您希望加快选区解析，最好的改进是使用新的 shift-reduce constituency parser .它比默认的 PCFG 解析器快几个数量级。

您以后的问题的答案:

为什么 CoreNLP 解析不是惰性的？这当然是可能的，但我们还没有在管道中实现。我们可能还没有在内部看到很多有必要这样做的用例。如果您有兴趣制作一个“惰性注释器包装器”，我们将很乐意接受您的贡献!
为什么一个句子的解析器几乎可以像整篇文章的解析器一样工作？ 默认的 Stanford PCFG 解析器是 cubic time complexity关于句子的长度。这就是为什么我们通常出于性能原因建议限制最大句子长度。另一方面，shift-reduce 解析器的运行时间与句子长度成线性关系。

关于java - 使用 Stanford CoreNLP 进行惰性解析以仅获取特定句子的情绪，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30714693/

java - 使用 Stanford CoreNLP 进行惰性解析以仅获取特定句子的情绪

上一篇：java - 调用java方法确认

下一篇：java - 如何设置没有@id 元素的@entity？