java - 从 stanford corenlp 的大数据集中查找特征

标签 java nlp stanford-nlp

我是斯坦福 NLP 的新人。我找不到任何好的、完整的文档或教程。我的工作是做情感分析。我有一个非常大的产品评论数据集。我已经根据用户给出的“开始”区分了它们的积极和消极。现在我需要找到最常出现的正面和负面形容词作为我的算法的特征。我从 here 了解如何进行分词、词形还原和词性标记。我有这样的文件。

评论是

Don't waste your money. This is a short DVD and the host is boring and offers information that is common sense to any idiot. Pass on this and buy something else. Very generic

输出是。

Sentence #1 (6 tokens):
Don't waste your money.
[Text=Do CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=VBP Lemma=do]
[Text=n't CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=RB Lemma=not]
[Text=waste CharacterOffsetBegin=6 CharacterOffsetEnd=11 PartOfSpeech=VB Lemma=waste]
[Text=your CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=PRP$ Lemma=you]
[Text=money CharacterOffsetBegin=17 CharacterOffsetEnd=22 PartOfSpeech=NN Lemma=money]
[Text=. CharacterOffsetBegin=22 CharacterOffsetEnd=23 PartOfSpeech=. Lemma=.]
Sentence #2 (21 tokens):
This is a short DVD and the host is boring and offers information that is common sense to any idiot.
[Text=This CharacterOffsetBegin=24 CharacterOffsetEnd=28 PartOfSpeech=DT Lemma=this]
[Text=is CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=VBZ Lemma=be]
[Text=a CharacterOffsetBegin=32 CharacterOffsetEnd=33 PartOfSpeech=DT Lemma=a]
[Text=short CharacterOffsetBegin=34 CharacterOffsetEnd=39 PartOfSpeech=JJ Lemma=short]
[Text=DVD CharacterOffsetBegin=40 CharacterOffsetEnd=43 PartOfSpeech=NN Lemma=dvd]
[Text=and CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=CC Lemma=and]
[Text=the CharacterOffsetBegin=48 CharacterOffsetEnd=51 PartOfSpeech=DT Lemma=the]
[Text=host CharacterOffsetBegin=52 CharacterOffsetEnd=56 PartOfSpeech=NN Lemma=host]
[Text=is CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=VBZ Lemma=be]
[Text=boring CharacterOffsetBegin=60 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=boring]
[Text=and CharacterOffsetBegin=67 CharacterOffsetEnd=70 PartOfSpeech=CC Lemma=and]
[Text=offers CharacterOffsetBegin=71 CharacterOffsetEnd=77 PartOfSpeech=VBZ Lemma=offer]
[Text=information CharacterOffsetBegin=78 CharacterOffsetEnd=89 PartOfSpeech=NN Lemma=information]
[Text=that CharacterOffsetBegin=90 CharacterOffsetEnd=94 PartOfSpeech=WDT Lemma=that]
[Text=is CharacterOffsetBegin=95 CharacterOffsetEnd=97 PartOfSpeech=VBZ Lemma=be]
[Text=common CharacterOffsetBegin=98 CharacterOffsetEnd=104 PartOfSpeech=JJ Lemma=common]
[Text=sense CharacterOffsetBegin=105 CharacterOffsetEnd=110 PartOfSpeech=NN Lemma=sense]
[Text=to CharacterOffsetBegin=111 CharacterOffsetEnd=113 PartOfSpeech=TO Lemma=to]
[Text=any CharacterOffsetBegin=114 CharacterOffsetEnd=117 PartOfSpeech=DT Lemma=any]
[Text=idiot CharacterOffsetBegin=118 CharacterOffsetEnd=123 PartOfSpeech=NN Lemma=idiot]
[Text=. CharacterOffsetBegin=123 CharacterOffsetEnd=124 PartOfSpeech=. Lemma=.]
Sentence #3 (8 tokens):
Pass on this and buy something else.
[Text=Pass CharacterOffsetBegin=125 CharacterOffsetEnd=129 PartOfSpeech=VB Lemma=pass]
[Text=on CharacterOffsetBegin=130 CharacterOffsetEnd=132 PartOfSpeech=IN Lemma=on]
[Text=this CharacterOffsetBegin=133 CharacterOffsetEnd=137 PartOfSpeech=DT Lemma=this]
[Text=and CharacterOffsetBegin=138 CharacterOffsetEnd=141 PartOfSpeech=CC Lemma=and]
[Text=buy CharacterOffsetBegin=142 CharacterOffsetEnd=145 PartOfSpeech=VB Lemma=buy]
[Text=something CharacterOffsetBegin=146 CharacterOffsetEnd=155 PartOfSpeech=NN Lemma=something]
[Text=else CharacterOffsetBegin=156 CharacterOffsetEnd=160 PartOfSpeech=RB Lemma=else]
[Text=. CharacterOffsetBegin=160 CharacterOffsetEnd=161 PartOfSpeech=. Lemma=.]
Sentence #4 (2 tokens):
Very generic
[Text=Very CharacterOffsetBegin=162 CharacterOffsetEnd=166 PartOfSpeech=RB Lemma=very]
[Text=generic CharacterOffsetBegin=167 CharacterOffsetEnd=174 PartOfSpeech=JJ Lemma=generic]

我已经像这样处理了 10000 个正片和 10000 个负片文件。现在我怎样才能轻松找到最常出现的正面和负面特征(形容词)?我是否需要读取所有输出(已处理)文件并制作这样的形容词的列表频率计数,或者 stanford corenlp 有什么简单的方法吗?

最佳答案

以下是处理带注释的评论并将形容词存储在计数器中的示例。

在示例中,电影评论“这部电影很棒!这是一部很棒的电影。”有“积极”的情绪。

我建议更改我的代码以加载到每个文件中,并使用文件的文本构建注释并记录该文件的情绪。

然后您可以浏览每个文件并为每个形容词建立一个包含正数和负数的计数器。

最终的计数器有形容词“great”,计数为 2。

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.util.CoreMap;

import java.util.Properties;

public class AdjectiveSentimentExample {

    public static void main(String[] args) throws Exception {

        Counter<String> adjectivePositiveCounts = new ClassicCounter<String>();
        Counter<String> adjectiveNegativeCounts = new ClassicCounter<String>();

        Annotation review = new Annotation("The movie was great!  It was a great film.");
        String sentiment = "positive";

        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        pipeline.annotate(review);
        for (CoreMap sentence : review.get(CoreAnnotations.SentencesAnnotation.class)) {
            for (CoreLabel cl : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                if (cl.get(CoreAnnotations.PartOfSpeechAnnotation.class).equals("JJ")) {
                    if (sentiment.equals("positive")) {
                        adjectivePositiveCounts.incrementCount(cl.word());
                    } else if (sentiment.equals("negative")) {
                        adjectiveNegativeCounts.incrementCount(cl.word());
                    }
                }

            }
        }

        System.out.println("---");
        System.out.println("positive adjective counts");
        System.out.println(adjectivePositiveCounts);
    }
}

关于java - 从 stanford corenlp 的大数据集中查找特征,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34252507/

相关文章:

java - 我们如何构建VM的调试版本

nlp - 使用 Gensim 减少 Google 的 Word2Vec 模型

python - 将 NLTK 树叶值作为字符串获取

java - 如何提取与 CoreEntityMention 匹配的维基百科实体 (WikiDictAnnotator)

java - 如何使用可选参数从 HashMap 中检索值

java - 无法从空堆栈中弹出操作数

java - 如何将字符串数字格式化为逗号和四舍五入?

python - 在 Python 中进行法语文本分析的最佳方法是什么?

machine-learning - 如何区分两个同名的不同命名实体?

nlp - 使用斯坦福 NLP 检测语言