java - 用于java的斯坦福nlp api : how to get the name as full not in parts

标签 java stanford-nlp

我的代码的目的是提交一个文档(无论是pdf还是doc文件)并获取其中的所有文本。给出要由 stanford nlp 分析的文本。该代码工作得很好。但假设文档中有名称,例如:“Pardeep Kumar”。收到的输出如下:

Pardeep NNP PERSON

Kumar NNP PERSON

但我希望它是这样的:

Pardeep Kumar NNP PERSON

我该怎么做?我如何检查两个相邻的单词实际上构成一个名称或类似的名称?我怎样才能不让它们被分成不同的单词?

这是我的代码:

public class readstuff {

      public static void analyse(String data) {

            // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


            // create an empty Annotation just with the given text
            Annotation document = new Annotation(data);

            // run all Annotators on this text
            pipeline.annotate(document);

            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

            // System.out.println("word"+"\t"+"POS"+"\t"+"NER");
            for (CoreMap sentence : sentences) {

                // traversing the words in the current sentence
                // a CoreLabel is a CoreMap with additional token-specific methods

                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    // this is the text of the token
                    String word = token.get(CoreAnnotations.TextAnnotation.class);
                    // this is the POS tag of the token
                    String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    // this is the NER label of the token
                    String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

                    if(ne.equals("PERSON") || ne.equals("LOCATION") || ne.equals("DATE") )
                    {

                        System.out.format("%32s%10s%16s",word,pos,ne);
                        System.out.println();
                    //System.out.println(word +"       \t"+pos +"\t"+ne);
                    }

                }
            }
        }

    public static void main(String[] args) throws FileNotFoundException, IOException, TransformerConfigurationException{

        JFileChooser window=new JFileChooser();
        int a=window.showOpenDialog(null);

        if(a==JFileChooser.APPROVE_OPTION){
            String name=window.getSelectedFile().getName();
            String extension = name.substring(name.lastIndexOf(".") + 1, name.length());
            String data = null;

            if(extension.equals("docx")){
                XWPFDocument doc=new XWPFDocument(new FileInputStream(window.getSelectedFile()));
                XWPFWordExtractor extract= new XWPFWordExtractor(doc);
                //System.out.println("docx file reading...");
                data=extract.getText();
                //extract.getMetadataTextExtractor();
            }
            else if(extension.equals("doc")){
                HWPFDocument doc=new HWPFDocument(new FileInputStream(window.getSelectedFile()));
                WordExtractor extract= new WordExtractor(doc);
                //System.out.println("doc file reading...");
                data=extract.getText();
            }
            else if(extension.equals("pdf")){
                //System.out.println(window.getSelectedFile());
                PdfReader reader=new PdfReader(new FileInputStream(window.getSelectedFile()));
                int n=reader.getNumberOfPages();
                for(int i=1;i<n;i++)
                {
                    //System.out.println(data);
                data=data+PdfTextExtractor.getTextFromPage(reader,i );
                }
            }
            else{
                System.out.println("format not supported");
            }

        analyse(data);  
        }
    }



}

最佳答案

您想要使用entitymentions注释器。

package edu.stanford.nlp.examples;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class EntityMentionsExample {

  public static void main(String[] args) {
    Annotation document =
        new Annotation("John Smith visited Los Angeles on Tuesday. He left Los Angeles on Wednesday.");
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);

    for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
        System.out.println(entityMention);
        System.out.println(entityMention.get(CoreAnnotations.EntityTypeAnnotation.class));
      }
    }
  }
}

关于java - 用于java的斯坦福nlp api : how to get the name as full not in parts,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46787542/

相关文章:

java - 如何修改StanfordNLP中的TokenRegex规则?

java - 使用 Spring HATEOAS 构建模板化搜索资源 uri

java - 操作方法时出现问题 - Java

java - Hibernate:一次删除所有具有关联的实体

java - StanleyNLP - TokensRegexNERAnnotator.readEntries 处的 ArrayIndexOutOfBoundsException(TokensRegexNERAnnotator.java :696))

nlp - 无法设置我自己的 Stanford CoreNLP 服务器,错误为 "Could not delete shutdown key file"

java - 打印 JTable 中选定的行

java - 根据输入字段分解错误消息

java - 斯坦福依赖关系转换工具

stanford-nlp - 我可以在不下载语言模块的情况下运行节 NER 吗?