java - 斯坦福 NER 3.4.1 问题

标签 java stanford-nlp text-extraction

我下载了 NER 3.4.1(于 08-27-14 发布)来训练特定领域的文章(技术含量高)。

想知道以下内容:

(1) 是否可以在每个提取的实体上输出偏移量?

(2) 可以输出每个抽取实体的置信度分数吗?

(3) 我在NER3.4.1上训练了不止一个CRF模型,貌似 Stanford GUI 只能显示单个 CRF 模型,有没有 显示多个 CRF 模型而不是编写包装器的方法?

最佳答案

(1) 是的,绝对是。 token (类:CoreLabel)返回每个 token 的每个存储开始和结束字符偏移量。获取整个实体的偏移量的最简单方法是使用 classifyToCharacterOffsets() 方法。请参见下面的示例。

(2) 是的,但在解释这些时有一些微妙之处。也就是说,很多不确定性往往不是这三个词应该是一个人还是一个组织,而是组织应该是两个词长还是三个词长等等。实际上,NER 分类器是把概率(真的,未归一化的集团潜力)在每个点的标签和标签序列的分配上。您可以使用多种方法来查询这些分数。我举例说明了几个更简单的,它们在下面呈现为概率。如果你想要更多并且知道如何解释 CRF,你可以获得一个句子的 CliqueTree 并用它做你想做的事。在实践中,与其做任何这些,通常更容易处理的只是一个 k-best 标签列表,每个标签都有一个完整的句子概率。我也在下面展示了这一点。

(3) 抱歉,现在的代码不行。这只是一个简单的演示。如果您想扩展它的功能,欢迎您。很高兴收回代码贡献!

下面是分发版中 NERDemo.java 的扩展版本,它说明了其中的一些选项。

package edu.stanford.nlp.ie.demo;

import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.crf.*;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.sequences.DocumentReaderAndWriter;
import edu.stanford.nlp.sequences.PlainTextDocumentReaderAndWriter;
import edu.stanford.nlp.util.Triple;

import java.util.List;


/** This is a demo of calling CRFClassifier programmatically.
 *  <p>
 *  Usage: {@code java -mx400m -cp "stanford-ner.jar:." NERDemo [serializedClassifier [fileName]] }
 *  <p>
 *  If arguments aren't specified, they default to
 *  classifiers/english.all.3class.distsim.crf.ser.gz and some hardcoded sample text.
 *  <p>
 *  To use CRFClassifier from the command line:
 *  </p><blockquote>
 *  {@code java -mx400m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier [classifier] -textFile [file] }
 *  </blockquote><p>
 *  Or if the file is already tokenized and one word per line, perhaps in
 *  a tab-separated value format with extra columns for part-of-speech tag,
 *  etc., use the version below (note the 's' instead of the 'x'):
 *  </p><blockquote>
 *  {@code java -mx400m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier [classifier] -testFile [file] }
 *  </blockquote>
 *
 *  @author Jenny Finkel
 *  @author Christopher Manning
 */

public class NERDemo {

  public static void main(String[] args) throws Exception {

    String serializedClassifier = "classifiers/english.all.3class.distsim.crf.ser.gz";

    if (args.length > 0) {
      serializedClassifier = args[0];
    }

    AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifier(serializedClassifier);

    /* For either a file to annotate or for the hardcoded text example, this
       demo file shows several ways to process the input, for teaching purposes.
    */

    if (args.length > 1) {

      /* For the file, it shows (1) how to run NER on a String, (2) how
         to get the entities in the String with character offsets, and
         (3) how to run NER on a whole file (without loading it into a String).
      */

      String fileContents = IOUtils.slurpFile(args[1]);
      List<List<CoreLabel>> out = classifier.classify(fileContents);
      for (List<CoreLabel> sentence : out) {
        for (CoreLabel word : sentence) {
          System.out.print(word.word() + '/' + word.get(CoreAnnotations.AnswerAnnotation.class) + ' ');
        }
        System.out.println();
      }

      System.out.println("---");
      out = classifier.classifyFile(args[1]);
      for (List<CoreLabel> sentence : out) {
        for (CoreLabel word : sentence) {
          System.out.print(word.word() + '/' + word.get(CoreAnnotations.AnswerAnnotation.class) + ' ');
        }
        System.out.println();
      }

      System.out.println("---");
      List<Triple<String, Integer, Integer>> list = classifier.classifyToCharacterOffsets(fileContents);
      for (Triple<String, Integer, Integer> item : list) {
        System.out.println(item.first() + ": " + fileContents.substring(item.second(), item.third()));
      }
      System.out.println("---");
      System.out.println("Ten best");
      DocumentReaderAndWriter<CoreLabel> readerAndWriter = classifier.makePlainTextReaderAndWriter();
      classifier.classifyAndWriteAnswersKBest(args[1], 10, readerAndWriter);

      System.out.println("---");
      System.out.println("Probabilities");
      classifier.printProbs(args[1], readerAndWriter);


      System.out.println("---");
      System.out.println("First Order Clique Probabilities");
      ((CRFClassifier) classifier).printFirstOrderProbs(args[1], readerAndWriter);

    } else {

      /* For the hard-coded String, it shows how to run it on a single
         sentence, and how to do this and produce several formats, including
         slash tags and an inline XML output format. It also shows the full
         contents of the {@code CoreLabel}s that are constructed by the
         classifier. And it shows getting out the probabilities of different
         assignments and an n-best list of classifications with probabilities.
      */

      String[] example = {"Good afternoon Rajat Raina, how are you today?",
                          "I go to school at Stanford University, which is located in California." };
      for (String str : example) {
        System.out.println(classifier.classifyToString(str));
      }
      System.out.println("---");

      for (String str : example) {
        // This one puts in spaces and newlines between tokens, so just print not println.
        System.out.print(classifier.classifyToString(str, "slashTags", false));
      }
      System.out.println("---");

      for (String str : example) {
        System.out.println(classifier.classifyWithInlineXML(str));
      }
      System.out.println("---");

      for (String str : example) {
        System.out.println(classifier.classifyToString(str, "xml", true));
      }
      System.out.println("---");

      int i=0;
      for (String str : example) {
        for (List<CoreLabel> lcl : classifier.classify(str)) {
          for (CoreLabel cl : lcl) {
            System.out.print(i++ + ": ");
            System.out.println(cl.toShorterString());
          }
        }
      }

      System.out.println("---");

    }
  }

}

关于java - 斯坦福 NER 3.4.1 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27136472/

相关文章:

python - 使用 beautifulsoup 从 html 中的 <b> 标签中提取文本

java - 未找到 Tomcat + Spring Boot + 'javax.websocket.server.ServerContainer'

java - 使用 core-nlp 的 DocumentPreprocessor 拆分句子时处理连词

java.lang.NoSuchMethodError : edu. stanford.nlp.util.Generics.newHashMap()Ljava/util/Map;

regex - 在 Vim 中添加正则表达式搜索列表

ocr - OCR 不再是问题吗?

java - 如何使用java将图像保存在磁盘上的文件夹中

java - 使用 Bash 从文件中提取整个 Java 语句

java - 在 JUnit 的 @Before 方法中调用 TimeZone.setDefault 是否安全?

java - 斯坦福 NLP 分类器示例