stanford-nlp - TokensRegex 规则以获得命名实体的正确输出

我终于能够获得 TokensRegex 代码来为命名实体提供某种输出。但输出并不完全是我想要的。我认为规则需要一些调整。

代码如下:

    public static void main(String[] args)
    {
        String  rulesFile = "D:\\Workspace\\resource\\NERRulesFile.rules.txt";
        String dataFile = "D:\\Workspace\\data\\GoldSetSentences.txt";

        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
        props.setProperty("ner.useSUTime", "0");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        pipeline.addAnnotator(new TokensRegexAnnotator(rulesFile));
        String inputText = "Bill Edelman, CEO and chairman of Paragonix Inc. announced that the company is expanding it's operations in China.";

        Annotation document = new Annotation(inputText);
        pipeline.annotate(document);
        Env env = TokenSequencePattern.getNewEnv();
        env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE); 
        env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFile);

        /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
        for (CoreMap sentence : sentences)
        {

            List<MatchedExpression> matched = extractor.extractExpressions(sentence);

            for(MatchedExpression phrase : matched){

                // Print out matched text and value
                System.out.println("matched: " + phrase.getText() + " with value: " + phrase.getValue());
                // Print out token information
                CoreMap cm = phrase.getAnnotation();
                for (CoreLabel token : cm.get(TokensAnnotation.class))
                {
                    if (token.tag().equals("NNP")){
                        String leftContext = token.before();
                        String rightContext = token.after();
                        System.out.println(leftContext);
                        System.out.println(rightContext);


                        String word = token.get(TextAnnotation.class);
                        String lemma = token.get(LemmaAnnotation.class);
                        String pos = token.get(PartOfSpeechAnnotation.class);
                        String ne = token.get(NamedEntityTagAnnotation.class);
                        System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);
                    }

                }
            }
        }
    }
}

这是规则文件:

$TITLES_CORPORATE  = (/chief/ /administrative/ /officer/|cao|ceo|/chief/ /executive/ /officer/|/chairman/|/vice/ /president/)
$ORGANIZATION_TITLES = (/International/|/inc\./|/corp/|/llc/)

# For detecting organization names like 'Paragonix Inc.' 

{    ruleType: "tokens",
     pattern: ([{pos: NNP}]+ $ORGANIZATION_TITLES),
     action: ( Annotate($0, ner, "ORGANIZATION"),Annotate($1, ner, "ORGANIZATION") ) 
}

# For extracting organization names from a pattern - 'Genome International is planning to expand its operations in China.' 
#(in the sentence given above the words planning and expand are part of the $OrgContextWords macros )
{
  ruleType: "tokens",
  pattern: (([{tag:/NNP.*/}]+) /,/*? /is|had|has|will|would/*? /has|had|have|will/*? /be|been|being/*? (?:[]{0,5}[{lemma:$OrgContextWords}]) /of|in|with|for|to|at|like|on/*?),
  result:  ( Annotate($1, ner, "ORGANIZATION") ) 
}

# For sentence like - Bill Edelman, Chairman and CEO of Paragonix Inc./ Zuckerberg CEO Facebook said today....  

ENV.defaults["stage"] = 1
{
  pattern: ( $TITLES_CORPORATE ), 
  action: ( Annotate($1, ner, "PERSON_TITLE")) 
}

ENV.defaults["stage"] = 2 
{
  ruleType: "tokens",
  pattern:  ( ([ { pos:NNP} ]+) /,/*? (?:TITLES_CORPORATE)? /and|&/*? (?:TITLES_CORPORATE)? /,/*? /of|for/? /,/*? [ { pos:NNP } ]+ ),
  result: (Annotate($1, ner, "PERSON"),Annotate($2, ner, "ORGANIZATION"))
}

我得到的输出是:

matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=PERSON
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION

我期望的输出是:

matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=ORGANIZATION
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION

此外，比尔·埃德尔曼 (Bill Edelman) 在此并未被识别为个人。尽管我已经制定了规则，但包含 Bill Edelman 的短语不会被识别。我是否需要为整个短语设置规则以与每个规则进行匹配，以免错过任何实体？

最佳答案

我在 GitHub 主页上制作了一个代表最新斯坦福 CoreNLP 的 jar(截至 4 月 14 日)。

此命令(使用最新代码)应该适用于使用 TokensRegexAnnotator(或者，如果使用 Java API，则可以将 tokensregex 设置传递到 Properties 对象):

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules example.rules -tokensregex.caseInsensitive -file example.txt -outputFormat text

这是我编写的规则文件，它显示基于句型的匹配:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

$ORGANIZATION_TITLES = "/inc\.|corp\./"

$COMPANY_INDICATOR_WORDS = "/company|corporation/"

{ pattern: (([{pos: NNP}]+ $ORGANIZATION_TITLES) /is/ /a/ $COMPANY_INDICATOR_WORDS), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }

{ pattern: ($COMPANY_INDICATOR_WORDS /that/ ([{pos: NNP}]+) /works/ /for/), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }

请注意，$0 表示整个模式，$1 表示第一个捕获组。因此，在这个示例中，我在代表我想要匹配的内容的文本周围放置了一个额外的括号。

我在示例中运行了这个:Paragonix Inc. 是 Joe Smith 工作的一家公司。

此示例显示在第二轮中使用第一轮的提取:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

$ORGANIZATION_TITLES = "/inc\.|corp\./"

$COMPANY_INDICATOR_WORDS = "/company|corporation/"

ENV.defaults["stage"] = 1

{ pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }

ENV.defaults["stage"] = 2

{ pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }

此示例应该适用于句子 Joe Smith Works for Paragonix Inc.

关于stanford-nlp - TokensRegex 规则以获得命名实体的正确输出，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43447585/

stanford-nlp - TokensRegex 规则以获得命名实体的正确输出

上一篇：r - 通过 RMarkdown-Shiny 从 DT 按钮下载不完整的 CSV/Excel 行

下一篇：matrix - 如何在 Fortran 中计算矩阵的指数