stanford-nlp - TokensRegex 规则以获得命名实体的正确输出

标签 stanford-nlp

我终于能够获得 TokensRegex 代码来为命名实体提供某种输出。但输出并不完全是我想要的。我认为规则需要一些调整。

代码如下:

    public static void main(String[] args)
    {
        String  rulesFile = "D:\\Workspace\\resource\\NERRulesFile.rules.txt";
        String dataFile = "D:\\Workspace\\data\\GoldSetSentences.txt";

        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
        props.setProperty("ner.useSUTime", "0");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        pipeline.addAnnotator(new TokensRegexAnnotator(rulesFile));
        String inputText = "Bill Edelman, CEO and chairman of Paragonix Inc. announced that the company is expanding it's operations in China.";

        Annotation document = new Annotation(inputText);
        pipeline.annotate(document);
        Env env = TokenSequencePattern.getNewEnv();
        env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE); 
        env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFile);

        /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
        for (CoreMap sentence : sentences)
        {

            List<MatchedExpression> matched = extractor.extractExpressions(sentence);

            for(MatchedExpression phrase : matched){

                // Print out matched text and value
                System.out.println("matched: " + phrase.getText() + " with value: " + phrase.getValue());
                // Print out token information
                CoreMap cm = phrase.getAnnotation();
                for (CoreLabel token : cm.get(TokensAnnotation.class))
                {
                    if (token.tag().equals("NNP")){
                        String leftContext = token.before();
                        String rightContext = token.after();
                        System.out.println(leftContext);
                        System.out.println(rightContext);


                        String word = token.get(TextAnnotation.class);
                        String lemma = token.get(LemmaAnnotation.class);
                        String pos = token.get(PartOfSpeechAnnotation.class);
                        String ne = token.get(NamedEntityTagAnnotation.class);
                        System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);
                    }

                }
            }
        }
    }
}

这是规则文件:

$TITLES_CORPORATE  = (/chief/ /administrative/ /officer/|cao|ceo|/chief/ /executive/ /officer/|/chairman/|/vice/ /president/)
$ORGANIZATION_TITLES = (/International/|/inc\./|/corp/|/llc/)

# For detecting organization names like 'Paragonix Inc.' 

{    ruleType: "tokens",
     pattern: ([{pos: NNP}]+ $ORGANIZATION_TITLES),
     action: ( Annotate($0, ner, "ORGANIZATION"),Annotate($1, ner, "ORGANIZATION") ) 
}

# For extracting organization names from a pattern - 'Genome International is planning to expand its operations in China.' 
#(in the sentence given above the words planning and expand are part of the $OrgContextWords macros )
{
  ruleType: "tokens",
  pattern: (([{tag:/NNP.*/}]+) /,/*? /is|had|has|will|would/*? /has|had|have|will/*? /be|been|being/*? (?:[]{0,5}[{lemma:$OrgContextWords}]) /of|in|with|for|to|at|like|on/*?),
  result:  ( Annotate($1, ner, "ORGANIZATION") ) 
}

# For sentence like - Bill Edelman, Chairman and CEO of Paragonix Inc./ Zuckerberg CEO Facebook said today....  

ENV.defaults["stage"] = 1
{
  pattern: ( $TITLES_CORPORATE ), 
  action: ( Annotate($1, ner, "PERSON_TITLE")) 
}

ENV.defaults["stage"] = 2 
{
  ruleType: "tokens",
  pattern:  ( ([ { pos:NNP} ]+) /,/*? (?:TITLES_CORPORATE)? /and|&/*? (?:TITLES_CORPORATE)? /,/*? /of|for/? /,/*? [ { pos:NNP } ]+ ),
  result: (Annotate($1, ner, "PERSON"),Annotate($2, ner, "ORGANIZATION"))
} 

我得到的输出是:

matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=PERSON
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION

我期望的输出是:

matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=ORGANIZATION
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION

此外,比尔·埃德尔曼 (Bill Edelman) 在此并未被识别为个人。尽管我已经制定了规则,但包含 Bill Edelman 的短语不会被识别。我是否需要为整个短语设置规则以与每个规则进行匹配,以免错过任何实体?

最佳答案

  1. 我在 GitHub 主页上制作了一个代表最新斯坦福 CoreNLP 的 jar(截至 4 月 14 日)。

  2. 此命令(使用最新代码)应该适用于使用 TokensRegexAnnotator(或者,如果使用 Java API,则可以将 tokensregex 设置传递到 Properties 对象):

    java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules example.rules -tokensregex.caseInsensitive -file example.txt -outputFormat text
    
  3. 这是我编写的规则文件,它显示基于句型的匹配:

    ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
    
    $ORGANIZATION_TITLES = "/inc\.|corp\./"
    
    $COMPANY_INDICATOR_WORDS = "/company|corporation/"
    
    { pattern: (([{pos: NNP}]+ $ORGANIZATION_TITLES) /is/ /a/ $COMPANY_INDICATOR_WORDS), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
    
    { pattern: ($COMPANY_INDICATOR_WORDS /that/ ([{pos: NNP}]+) /works/ /for/), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
    

    请注意,$0 表示整个模式,$1 表示第一个捕获组。因此,在这个示例中,我在代表我想要匹配的内容的文本周围放置了一个额外的括号。

    我在示例中运行了这个:Paragonix Inc. 是 Joe Smith 工作的一家公司。

    此示例显示在第二轮中使用第一轮的提取:

    ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
    
    $ORGANIZATION_TITLES = "/inc\.|corp\./"
    
    $COMPANY_INDICATOR_WORDS = "/company|corporation/"
    
    ENV.defaults["stage"] = 1
    
    { pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
    
    ENV.defaults["stage"] = 2
    
    { pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
    

此示例应该适用于句子 Joe Smith Works for Paragonix Inc.

关于stanford-nlp - TokensRegex 规则以获得命名实体的正确输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43447585/

相关文章:

java - 文档中的项目符号在 GATE NLP 中变成问号

python - 使用 NLTK 简化法语 POS 标签集

stanford-nlp - StanleyCoreNLP : TokenMgrError: Lexical error at line 1, 第 14 列。在 "E"之后遇到 : "\\" (69),

java - 评估 Stanford NER CRF 并以编程方式计算 Precision/Recall

stanford-nlp - 斯坦福 NER 中的交叉验证

nlp - 使用 Stanford CoreNLP 进行依赖解析中的情感排名节点?

python - 将自定义训练的 NER 模型与斯坦福 CoreNLP 中的现有默认模型集成

java - 使用 CoreNLP 单独标记和后标记

java - 斯坦福 CoreNLP 的去标记化

java - 当我用Stanford CoreNLP重新训练情感模型并与相关论文的结果进行比较时,我得到了不同的结果