java - StanleyNLP - TokensRegexNERAnnotator.readEntries 处的 ArrayIndexOutOfBoundsException( :696))

标签 java nlp stanford-nlp

我想使用 stanfordNLP 的 TokensRegexNERAnnotator 将以下内容识别为技能。

专业领域 知识领域 计算机技能 技术经验 技术技能


代码 -

    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true));
    String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."};
    List tokens = new ArrayList<>();

    // traversing each sentence from array of sentence.
    for (String txt : tests) {
         System.out.println("String is : " + txt);

         // create an empty Annotation just with the given text
         Annotation document = new Annotation(txt);

         List<CoreMap> sentences = document.get(SentencesAnnotation.class);

         /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
      for (CoreMap sentence : sentences) {
         for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
             System.out.println("annotated coreMap sentences : " + token);
             // Extracting NER tag for current token
             String ne = token.get(NamedEntityTagAnnotation.class);
             String word = token.get(CoreAnnotations.TextAnnotation.class);
             System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class));
             System.out.println("Lemma : " + token.get(LemmaAnnotation.class));
             System.out.println("Named Entity : " + ne);

我的正则表达式规则文件是 -

$SKILL_FIRST_KEYWORD = "/领域/|/领域/|/技术/|/计算机/|/专业/" $SKILL_KEYWORD =“/知识/|/技能/|/技能/|/专业知识/|/经验/”

tokens = { 类型:“CLASS”,值:“edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation” }

{ 规则类型:“ token ”, 模式:($SKILL_FIRST_KEYWORD + $SKILL_KEYWORD), 结果:“技能” }

我收到 ArrayIndexOutOfBoundsException 错误。我猜我的规则文件有问题。有人可以指出我哪里出错了吗?

所需输出 -

专业领域 - 技能

知识领域 - 技能

计算机技能 - 技能




您应该使用 TokensRegexAnnotator,而不是 TokensRegexNERAnnotator。


