我编写了一个 WEKA java 代码来训练 4 个分类器。我保存了分类器模型,并希望使用它们来预测新的未见过的实例(将其想象为想要测试推文是正面还是负面的人)。
我在训练数据上使用了StringToWordsVector过滤器。为了避免“Src 和 Dest 的属性数不同”错误,我使用以下代码使用经过训练的数据来训练过滤器,然后再将过滤器应用于新实例以尝试预测是否有新的实例实例是正的还是负的。但我就是做不到。
Classifier cls = (Classifier) weka.core.SerializationHelper.read("models/myModel.model"); //reading one of the trained classifiers
BufferedReader datafile = readDataFile("Tweets/tone1.ARFF"); //read training data
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
Filter filter = new StringToWordVector(50);//keep 50 words
filter.setInputFormat(data);
Instances filteredData = Filter.useFilter(data, filter);
// rebuild classifier
cls.buildClassifier(filteredData);
String testInstance= "Text that I want to use as an unseen instance and predict whether it's positive or negative";
System.out.println(">create test instance");
FastVector attributes = new FastVector(2);
attributes.addElement(new Attribute("text", (FastVector) null));
// Add class attribute.
FastVector classValues = new FastVector(2);
classValues.addElement("Negative");
classValues.addElement("Positive");
attributes.addElement(new Attribute("Tone", classValues));
// Create dataset with initial capacity of 100, and set index of class.
Instances tests = new Instances("test istance", attributes, 100);
tests.setClassIndex(tests.numAttributes() - 1);
Instance test = new Instance(2);
// Set value for message attribute
Attribute messageAtt = tests.attribute("text");
test.setValue(messageAtt, messageAtt.addStringValue(testInstance));
test.setDataset(tests);
Filter filter2 = new StringToWordVector(50);
filter2.setInputFormat(tests);
Instances filteredTests = Filter.useFilter(tests, filter2);
System.out.println(">train Test filter using training data");
Standardize sfilter = new Standardize(); //Match the number of attributes between src and dest.
sfilter.setInputFormat(filteredData); // initializing the filter with training set
filteredTests = Filter.useFilter(filteredData, sfilter); // create new test set
ArffSaver saver = new ArffSaver(); //save test data to ARFF file
saver.setInstances(filteredTests);
File unseenFile = new File ("Tweets/unseen.ARFF");
saver.setFile(unseenFile);
saver.writeBatch();
当我尝试使用过滤后的训练数据标准化输入数据时,我得到一个新的 ARFF 文件 (unseen.ARFF),但有 2000 个(相同数量的训练数据)实例,其中大多数值为负数。我不明白为什么或如何删除这些实例。
System.out.println(">Evaluation"); //without the following 2 lines I get ArrayIndexOutOfBoundException.
filteredData.setClassIndex(filteredData.numAttributes() - 1);
filteredTests.setClassIndex(filteredTests.numAttributes() - 1);
Evaluation eval = new Evaluation(filteredData);
eval.evaluateModel(cls, filteredTests);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
打印评估结果,例如,我想查看该实例的积极或消极程度的百分比,但我得到以下结果。我还希望看到 1 个实例而不是 2000 个。任何有关如何执行此操作的帮助都会很棒。
> Results
======
Correlation coefficient 0.0285
Mean absolute error 0.8765
Root mean squared error 1.2185
Relative absolute error 409.4123 %
Root relative squared error 121.8754 %
Total Number of Instances 2000
谢谢
最佳答案
使用eval.predictions()
。这是一个java.util.ArrayList<Prediction>
。然后你可以使用 Prediction.weight() 方法来获取你的测试变量有多少正数或负数......
关于java - Weka 如何使用 Java 代码预测新的未见过的实例?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33760145/