java - 如何阅读 OpenNLP 中命名实体识别的文档

标签 java opennlp named-entity-recognition

我是 java 的新手,对我的要求是阅读文档并执行命名实体文档。对于简单的字符串,我做了以下操作

InputStream is = new FileInputStream("data/en-ner-person.bin");
TokenNameFinderModel model = new TokenNameFinderModel(is);
is.close();
NameFinderME nameFinder = new NameFinderME(model);
String []sentence = new String[]{"Smith",
                "Smithosian",
                "is",
                "a",
                "person"
                };



   Span nameSpans[] = nameFinder.find(sentence);

但是,我需要实际从文档中读取流,然后生成 XML。 谁能告诉我该怎么做

谢谢

最佳答案

从来没有人回答过这个问题,所以我希望现在还不算太晚。

对于实体提取,您需要具有 String 格式的文档文本。检查 stackoverflow 以了解将文档文本转换为 String 的多种方法(这里的简短回答是将 BufferedInputStream 用于文本文件,或将 Apache Tika 用于 MS 和 PDF 文件)

一旦你在内存中有了文​​档文本,这段代码应该可以帮助你进行句子边界检测、标记化和 NER。然后获取此结果并使用 docname/docid、可能是一些文件元数据、实际实体字符串、类型和 Span(文本中 NE 命中的位置)以任何方式生成 xmlDoc

这门课应该让你开始

package processors;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

public class OpenNLPNER implements Runnable
{

    static TokenizerModel tm = null;
    static TokenNameFinderModel locModel = null;
    String doc;
    NameFinderME myNameFinder;
    TokenizerME wordBreaker;
    SentenceDetector sd;

    public OpenNLPNER()
    {
    }

    public OpenNLPNER(String document, SentenceDetector sd, NameFinderME mf, TokenizerME wordBreaker)
    {
        System.out.println("got doc");
        this.sd = sd;
        this.myNameFinder = mf;
        this.wordBreaker = wordBreaker;
        doc = document;
    }

    private static List<String> getMyDocsFromSomewhere()
    {
        //this should return an object that has all the info about the doc you want
        return new ArrayList<String>();
    }

    public static void main(String[] args)
    {
        try
        {
            String modelPath = "c:\\temp\\opennlpmodels\\";

            if (tm == null)
            {
                //user does normal namefinder instantiations...
                InputStream stream = new FileInputStream(new File(modelPath + "en-token.zip"));
                // new SentenceDetectorME(new SentenceModel(new FileInputStream(new File(modelPath + "en-sent.zip"))));
                tm = new TokenizerModel(stream);
                // new TokenizerME(tm);
                locModel = new TokenNameFinderModel(new FileInputStream(new File(modelPath + "en-ner-location.bin")));
                //  new NameFinderME(locModel);
            }


            System.out.println("getting data");
            List<String> docs = getMyDocsFromSomewhere();
            System.out.println("\tdone getting data");
            // FileWriter fw = new FileWriter("C:\\apache\\modelbuilder\\sentences.txt");




            for (String docu : docs)
            {
                //you could also use the runnable here and launch in a diff thread
                new OpenNLPNER(docu,
                        new SentenceDetectorME(new SentenceModel(new FileInputStream(new File(modelPath + "en-sent.zip")))),
                        new NameFinderME(locModel), new TokenizerME(tm)).run();

            }

            System.out.println("done");


        } catch (Exception ex)
        {
            System.out.println(ex);
        }


    }

    @Override
    public void run()
    {
        try
        {
            process(doc);
        } catch (Exception ex)
        {
            System.out.println(ex);
        }
    }

    public void process(String document) throws Exception
    {

        //  System.out.println(document);
        //user instantiates the non static entitylinkerproperty object and constructs is with a pointer to the prop file they need to use
        String modelPath = "C:\\apache\\entitylinker\\";


        //input document
        myNameFinder.clearAdaptiveData();
        //user splits doc to sentences
        String[] sentences = sd.sentDetect(document);
        //get the sentence spans
        Span[] sentenceSpans = sd.sentPosDetect(document);
        Span[][] allnamesInDoc = new Span[sentenceSpans.length][];
        String[][] allTokensInDoc = new String[sentenceSpans.length][];

        for (int sentenceIndex = 0; sentenceIndex < sentences.length; sentenceIndex++)
        {
            String[] stringTokens = wordBreaker.tokenize(sentences[sentenceIndex]);
            Span[] tokenSpans = wordBreaker.tokenizePos(sentences[sentenceIndex]);
            Span[] spans = myNameFinder.find(stringTokens);
            allnamesInDoc[sentenceIndex] = spans;
            allTokensInDoc[sentenceIndex] = stringTokens;
        }

        //now access the data like this...
        for (int s = 0; s < sentenceSpans.length; s++)
        {
            Span[] namesInSentence = allnamesInDoc[s];
            String[] tokensInSentence = allTokensInDoc[s];
            String[] entities = Span.spansToStrings(namesInSentence, tokensInSentence);
            for (String entity : entities)
            {
                //start building up the XML here....
                System.out.println(entity + " Was in setnence " + s + " @ " + namesInSentence[s].toString());
            }
        }

    }
}

关于java - 如何阅读 OpenNLP 中命名实体识别的文档,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19293425/

相关文章:

java - Android中是否有可用的视频编辑库

java - 无法转换 opennlp Parse 的结果

Java OpenNLP从句子中提取所有名词

python - 命名实体识别——与字典直接匹配

named-entity-recognition - 在函数注册表 'spacy-transformers.TransformerModel.v3' 中找不到函数 'architectures'

java - 自动虚拟机部署

java - 在本地模式下运行 storm-starter 拓扑时出现 ClassNotFound 错误(Win10、OS X)

java - 基于自定义特征的文本分类

spacy - 在spaCy中,为什么 '\n'经常被英文NER标记为GPE?

java - 如何在 Android 中明确禁用 HTTP 连接的分块流模式?