java - OpenNLP-文档分类器-如何根据状态对文档进行分类；文档语言不是英语，也是默认功能吗？

我想使用 OpenNLP 的文档分类器根据文档的状态对文档进行分类:预打开、打开、锁定、关闭等。

我有 5 个类，并且使用朴素贝叶斯算法，训练集中有 60 个文档，并使用 1 个截止参数对我的集合进行了 1000 次迭代训练。

但是没有成功，当我测试它们时，我没有得到好的结果。我在想也许是因为文档的语言(不是英语)或者我应该以某种方式添加状态作为功能。我已经在分类器中设置了默认功能，而且我对它们也不是很熟悉。

结果应该被锁定，但其分类为打开。

InputStreamFactory in=null;
try {
in= new MarkableFileInputStreamFactory(new 
File("D:\\JavaNlp\\doccategorizer\\doccategorizer.txt"));
}
catch (FileNotFoundException e2) {
System.out.println("Creating new input stream");
e2.printStackTrace();
}

ObjectStream lineStream=null;
ObjectStream sampleStream=null;

try {
lineStream = new PlainTextByLineStream(in, "UTF-8");
sampleStream = new DocumentSampleStream(lineStream);            
}
catch (IOException e1) {
System.out.println("Document Sample Stream");
e1.printStackTrace();
}


TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 1000+"");
params.put(TrainingParameters.CUTOFF_PARAM, 1+"");
params.put(AbstractTrainer.ALGORITHM_PARAM, 
NaiveBayesTrainer.NAIVE_BAYES_VALUE);


DoccatModel model=null;
try {
model = DocumentCategorizerME.train("en", sampleStream, params, new 
DoccatFactory());
} 
catch (IOException e) 
{
System.out.println("Training...");
e.printStackTrace();
}


System.out.println("\nModel is successfully trained.");


BufferedOutputStream modelOut=null;

try {
modelOut = new BufferedOutputStream(new 
FileOutputStream("D:\\JavaNlp\\doccategorizer\\classifier-maxent.bin"));
} 
catch (FileNotFoundException e) {

System.out.println("Creating output stream");
e.printStackTrace();
}
try {
model.serialize(modelOut);
}
catch (IOException e) {

System.out.println("Serialize...");
e.printStackTrace();
}
System.out.println("\nTrained model is kept in: 
"+"model"+File.separator+"en-cases-classifier-maxent.bin");

DocumentCategorizer doccat = new DocumentCategorizerME(model);
String[] docWords = "Some text here...".replaceAll("[^A-Za-z]", " ").split(" ");
double[] aProbs = doccat.categorize(docWords);


System.out.println("\n---------------------------------\nCategory : 
Probability\n---------------------------------");
for(int i=0;i<doccat.getNumberOfCategories();i++){
System.out.println(doccat.getCategory(i)+" : "+aProbs[i]);
}
System.out.println("---------------------------------");

System.out.println("\n"+doccat.getBestCategory(aProbs)+" : is the category 
for the given sentence");

results results2

有人可以建议我如何对我的文档进行分类，例如我应该先添加语言检测器，还是添加新功能？

提前致谢

最佳答案

默认情况下，文档分类器采用文档文本并形成词袋。袋子里的每个词都成为一个特征。只要该语言可以由英语分词器(默认情况下又是空格分词器)进行分词，我猜该语言不是您的问题。我会检查您用于训练数据的数据格式。其格式应如下所示:

category<tab>document text

文本应适合一行。文档分类器的 opennlp 文档可以在 http://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training.tool 找到。

如果您可以提供一两行训练数据来帮助检查格式，将会很有帮助。

编辑:另一个潜在问题。 60 个文档可能不足以训练一个好的分类器，特别是如果您的词汇量很大的话。另外，即使这不是英语，请告诉我它不是多种语言。最后，文档文本是对文档进行分类的最佳方式吗？文档本身的元数据会产生更好的功能吗？

希望有帮助。

关于java - OpenNLP-文档分类器-如何根据状态对文档进行分类；文档语言不是英语，也是默认功能吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53913787/

java - OpenNLP-文档分类器-如何根据状态对文档进行分类；文档语言不是英语，也是默认功能吗？

上一篇：javascript - 当react native出现这个错误时: transformClassesWithDexBuilderForDebug?

下一篇：Java 'getLocalName()' 即使使用 'setNamespaceAware(true)' 也会返回 null