java - 将训练数据添加到现有模型(bin 文件)

标签 java text-mining training-data opennlp

我正在尝试使用 OpenNLP 将额外的训练数据添加到我的 nl-personTest.bin 文件中。 现在我的问题是,当我运行代码来添加额外的训练数据时,它会删除现有的数据并仅添加新数据。

如何添加额外的训练数据而不是替换它?

我确实使用了以下代码(从 Open NLP NER is not properly trained 获取)

public class TrainNames
    {
    public static void main(String[] args) 
    {
        train("nl", "person", "namen.txt", "nl-ner-personTest.bin");
    }

    public static String train(String lang, String entity,InputStreamFactory inputStream, FileOutputStream modelStream) {

        Charset charset = Charset.forName("UTF-8");
        TokenNameFinderModel model = null;
        ObjectStream<NameSample> sampleStream = null;
        try {
            ObjectStream<String> lineStream = new PlainTextByLineStream(inputStream, charset);
            sampleStream = new NameSampleDataStream(lineStream);
            TokenNameFinderFactory nameFinderFactory = new TokenNameFinderFactory();
            model = NameFinderME.train("nl", "person", sampleStream, TrainingParameters.defaultParams(),
                nameFinderFactory);
        } catch (FileNotFoundException fio) {

        } catch (IOException io) {

        } finally {
            try {
                sampleStream.close();
            } catch (IOException io) {

            }
        }
        BufferedOutputStream modelOut = null;
        try {
            modelOut = new BufferedOutputStream(modelStream);
            model.serialize(modelOut);
        } catch (IOException io) {

        } finally {
            if (modelOut != null) {
                try {
                    modelOut.close();
                } catch (IOException io) {

                }
            }
        }
        return "Something goes wrong with training module.";
    }

    public static String train(String lang, String entity, String taggedCoprusFile,
                               String modelFile) {
        try {
            InputStreamFactory inputStream = new InputStreamFactory() {
                FileInputStream fileInputStream = new FileInputStream("namen.txt");

                public InputStream createInputStream() throws IOException {
                    return fileInputStream;
                }
            };

            return train(lang, entity, inputStream,
                new FileOutputStream(modelFile));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return "Something goes wrong with training module.";
    } }

有人有解决这个问题的想法吗?

因为如果我想要一个准确的训练集,我需要至少 15K 句子说文档。

最佳答案

我认为OpenNLP不支持扩展现有的二进制NLP模型。

如果您有所有可用的训练数据,请将它们全部收集起来,然后立即训练它们。您可以使用SequenceInputStream 。我修改了您的示例以使用另一个 InputStreamFactory

public String train(String lang, String entity, InputStreamFactory inputStream, FileOutputStream modelStream) {

    // ....
    try {
        ObjectStream<String> lineStream = new PlainTextByLineStream(trainingDataInputStreamFactory(Arrays.asList(
                new File("trainingdata1.txt"),
                new File("trainingdata2.txt"),
                new File("trainingdata3.txt")
        )), charset);

        // ...
    } 

    // ...
}

private InputStreamFactory trainingDataInputStreamFactory(List<File> trainingFiles) {
    return new InputStreamFactory() {
        @Override
        public InputStream createInputStream() throws IOException {
            List<InputStream> inputStreams = trainingFiles.stream()
                    .map(f -> {
                        try {
                            return new FileInputStream(f);
                        } catch (FileNotFoundException e) {
                            e.printStackTrace();
                            return null;
                        }
                    })
                    .filter(Objects::nonNull)
                    .collect(Collectors.toList());

            return new SequenceInputStream(new Vector<>(inputStreams).elements());
        }
    };
}

关于java - 将训练数据添加到现有模型(bin 文件),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46708048/

相关文章:

java - ListView 未显示/出现 android studio

java - 无法解析日期并获取毫秒

python - 文本挖掘:确定动词指的是哪个名词?

machine-learning - 我应该从一个时期到另一个时期使用相同的训练集(卷积神经网络)

image - Tensorflow、train_step 馈送不正确

java stream sort() object to descended set of integers 失败

python-3.x - Gensim Word2Vec : poor training performance.

machine-learning - 作为 SVM 向量的文本特征表示

pattern-matching - 使用预定义的字体图像训练 "tesseract ocr"

java - Groovy 脚本无法正确解析 XML