matlab - 如何使用 MATLAB 从 WEKA 中检索类值

标签 matlab machine-learning classification weka decision-tree

我正在尝试使用 MATLAB 和 WEKA API 从 WEKA 检索类。一切看起来都很好,但类始终为 0。知道吗??

我的数据集有 241 个属性,将 WEKA 应用于此数据集我获得了正确的结果。

创建第一个训练对象和测试对象,然后构建分类器并执行 classifyInstance。但这给出了错误的结果

    train = [xtrain ytrain];
    test =  [xtest];

    save ('train.txt','train','-ASCII');    
    save ('test.txt','test','-ASCII');

%## paths
WEKA_HOME = 'C:\Program Files\Weka-3-7';
javaaddpath([WEKA_HOME '\weka.jar']);

fName = 'train.txt';

%## read file

loader = weka.core.converters.MatlabLoader();

loader.setFile( java.io.File(fName) );
train = loader.getDataSet();
train.setClassIndex( train.numAttributes()-1 );

% setting class as nominal

v(1) = java.lang.String('-R');
v(2) = java.lang.String('242');
options = cat(1,v(1:end));

filter = weka.filters.unsupervised.attribute.NumericToNominal();
filter.setOptions(options); 
filter.setInputFormat(train);   
train = filter.useFilter(train, filter);

fName = 'test.txt';

%## read file

loader = weka.core.converters.MatlabLoader();

loader.setFile( java.io.File(fName) );
test = loader.getDataSet();

%## dataset
relationName = char(test.relationName);
numAttr = test.numAttributes;
numInst = test.numInstances;

%## classification
classifier = weka.classifiers.trees.J48();
classifier.buildClassifier( train );
fprintf('Classifier: %s %s\n%s', ...
    char(classifier.getClass().getName()), ...
    char(weka.core.Utils.joinOptions(classifier.getOptions())), ...
    char(classifier.toString()) )

classes =[];

for i=1:numInst

     classes(i) = classifier.classifyInstance(test.instance(i-1));


end

这是一个新代码,但仍然无法正常工作 - classes = 0。Weka 对相同算法和数据集的输出正常

=== 按类别分类的详细准确性 ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.99      0.015      0.985     0.99      0.988      0.991    0
                 0.985     0.01       0.99      0.985     0.988      0.991    1
Weighted Avg.    0.988     0.012      0.988     0.988     0.988      0.991

=== Confusion Matrix ===

    a    b   <-- classified as
 1012   10 |    a = 0

   15 1003 |    b = 1

 

    ytest1 = ones(size(xtest,1),1); 

    train = [xtrain ytrain];
    test =  [xtest ytest1];

    save ('train.txt','train','-ASCII');    
    save ('test.txt','test','-ASCII');

%## paths
WEKA_HOME = 'C:\Program Files\Weka-3-7';
javaaddpath([WEKA_HOME '\weka.jar']);

fName = 'train.txt';

%## read file

loader = weka.core.converters.MatlabLoader();

loader.setFile( java.io.File(fName) );
train = loader.getDataSet();
train.setClassIndex( train.numAttributes()-1 );

v(1) = java.lang.String('-R');
v(2) = java.lang.String('242');
options = cat(1,v(1:end));

filter = weka.filters.unsupervised.attribute.NumericToNominal();
filter.setOptions(options); 
filter.setInputFormat(train);   
train = filter.useFilter(train, filter);

fName = 'test.txt';

%## read file

loader = weka.core.converters.MatlabLoader();

loader.setFile( java.io.File(fName) );
test = loader.getDataSet();

filter = weka.filters.unsupervised.attribute.NumericToNominal();
filter.setOptions( weka.core.Utils.splitOptions('-R last') );
filter.setInputFormat(test);   
test = filter.useFilter(test, filter);


%## dataset
relationName = char(test.relationName);
numAttr = test.numAttributes;
numInst = test.numInstances;

%## classification
classifier = weka.classifiers.trees.J48();

classifier.buildClassifier( train );
fprintf('Classifier: %s %s\n%s', ...
    char(classifier.getClass().getName()), ...
    char(weka.core.Utils.joinOptions(classifier.getOptions())), ...
    char(classifier.toString()) )

classes = zeros(numInst,1);
for i=1:numInst   
     classes(i) = classifier.classifyInstance(test.instance(i-1));     
end

这是Java中类分发的代码片段

// output predictions
    System.out.println("# - actual - predicted - error - distribution");
    for (int i = 0; i < test.numInstances(); i++) {
      double pred = cls.classifyInstance(test.instance(i));
      double[] dist = cls.distributionForInstance(test.instance(i));
      System.out.print((i+1));
      System.out.print(" - ");
      System.out.print(test.instance(i).toString(test.classIndex()));
      System.out.print(" - ");
      System.out.print(test.classAttribute().value((int) pred));
      System.out.print(" - ");
      if (pred != test.instance(i).classValue())
    System.out.print("yes");
      else
    System.out.print("no");
      System.out.print(" - ");
      System.out.print(Utils.arrayToString(dist));
      System.out.println();

我像这样将它转换为 MATLAB 代码

classes = zeros(numInst,1);
for i=1:numInst
     pred = classifier.classifyInstance(test.instance(i-1));  
     classes(i) = str2num(char(test.classAttribute().value(( pred))));
end

但是类输出不正确。

在您的回答中,您没有表明 pred 包含类和 predProb 概率。打印出来!!!

最佳答案

训练和测试数据必须具有相同数量的属性。因此,在您的情况下,即使您不知道测试数据的实际类别,也只需使用虚拟值:

ytest = ones(size(xtest,1),1);    %# dummy class values for test data

train = [xtrain ytrain];
test =  [xtest ytest];

save ('train.txt','train','-ASCII');    
save ('test.txt','test','-ASCII');

不要忘记在加载测试数据集时将其转换为名义属性(就像您对训练数据集所做的那样):

filter = weka.filters.unsupervised.attribute.NumericToNominal();
filter.setOptions( weka.core.Utils.splitOptions('-R last') );
filter.setInputFormat(test);   
test = filter.useFilter(test, filter);

最后,您可以调用经过训练的 J48 分类器来预测测试实例的类别值:

classes = zeros(numInst,1);
for i=1:numInst
     classes(i) = classifier.classifyInstance(test.instance(i-1));
end

编辑

如果不知道您正在使用的数据,很难说清楚..

所以让我用一个完整的例子来说明。我将使用 Fisher Iris 数据(4 个属性、150 个实例、3 个类)在 MATLAB 中创建数据集。

%# load dataset (data + labels)
load fisheriris
X = meas;
Y = grp2idx(species);

%# partition the data into training/testing
c = cvpartition(Y, 'holdout',1/3);
xtrain = X(c.training,:);
ytrain = Y(c.training);
xtest = X(c.test,:);
ytest = Y(c.test);          %# or dummy values

%# save as space-delimited text file
train = [xtrain ytrain];
test =  [xtest ytest];
save train.txt train -ascii
save test.txt test -ascii

我应该在这里提到,在使用 NumericToNominal 过滤器之前,确保类值在两个数据集中的每一个中都得到完整表示是很重要的。否则,训练集和测试集可能不兼容。我的意思是,您必须至少从每个类中的每个类值中获得一个实例。因此,如果您使用虚拟值,也许我们可以这样做:

ytest = ones(size(xtest,1),1);
v = unique(Y);
ytest(1:numel(v)) = v;

接下来,让我们使用 Weka API 读取新创建的文件。我们将最后一个属性从数字转换为标称(以启用分类):

%# read train/test files using Weka
fName = 'train.txt';
loader = weka.core.converters.MatlabLoader();
loader.setFile( java.io.File(fName) );
train = loader.getDataSet();
train.setClassIndex( train.numAttributes()-1 );

fName = 'test.txt';
loader = weka.core.converters.MatlabLoader();
loader.setFile( java.io.File(fName) );
test = loader.getDataSet();
test.setClassIndex( test.numAttributes()-1 );

%# convert last attribute (class) from numeric to nominal
filter = weka.filters.unsupervised.attribute.NumericToNominal();
filter.setOptions( weka.core.Utils.splitOptions('-R last') );
filter.setInputFormat(train);   
train = filter.useFilter(train, filter);

filter = weka.filters.unsupervised.attribute.NumericToNominal();
filter.setOptions( weka.core.Utils.splitOptions('-R last') );
filter.setInputFormat(test);   
test = filter.useFilter(test, filter);

现在我们训练一个 J48 分类器并用它来预测测试实例的类别:

%# train a J48 tree
classifier = weka.classifiers.trees.J48();
classifier.setOptions( weka.core.Utils.splitOptions('-c last -C 0.25 -M 2') );
classifier.buildClassifier( train );

%# classify test instances
numInst = test.numInstances();
pred = zeros(numInst,1);
predProbs = zeros(numInst, train.numClasses());
for i=1:numInst
     pred(i) = classifier.classifyInstance( test.instance(i-1) );
     predProbs(i,:) = classifier.distributionForInstance( test.instance(i-1) );
end

最后,我们根据测试数据评估经过训练的模型性能(这应该类似于您在 Weka Explorer 中看到的)。显然,这只有在测试实例具有真正的类值(而不是虚拟值)时才有意义:

eval = weka.classifiers.Evaluation(train);

eval.evaluateModel(classifier, test, javaArray('java.lang.Object',1));

fprintf('=== Run information ===\n\n')
fprintf('Scheme: %s %s\n', ...
    char(classifier.getClass().getName()), ...
    char(weka.core.Utils.joinOptions(classifier.getOptions())) )
fprintf('Relation: %s\n', char(train.relationName))
fprintf('Instances: %d\n', train.numInstances)
fprintf('Attributes: %d\n\n', train.numAttributes)

fprintf('=== Classifier model ===\n\n')
disp( char(classifier.toString()) )

fprintf('=== Summary ===\n')
disp( char(eval.toSummaryString()) )
disp( char(eval.toClassDetailsString()) )
disp( char(eval.toMatrixString()) )

上述示例在 MATLAB 中的输出:

=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: train.txt-weka.filters.unsupervised.attribute.NumericToNominal-Rlast
Instances: 100
Attributes: 5

=== Classifier model ===

J48 pruned tree
------------------

att_4 <= 0.6: 1 (33.0)
att_4 > 0.6
|   att_3 <= 4.8
|   |   att_4 <= 1.6: 2 (32.0)
|   |   att_4 > 1.6: 3 (3.0/1.0)
|   att_3 > 4.8: 3 (32.0)

Number of Leaves  :     4

Size of the tree :  7

=== Summary ===

Correctly Classified Instances          46               92      %
Incorrectly Classified Instances         4                8      %
Kappa statistic                          0.8802
Mean absolute error                      0.0578
Root mean squared error                  0.2341
Relative absolute error                 12.9975 %
Root relative squared error             49.6536 %
Coverage of cases (0.95 level)          92      %
Mean rel. region size (0.95 level)      34      %
Total Number of Instances               50     

=== Detailed Accuracy By Class ===

             TP Rate  FP Rate  Precision   Recall  F-Measure   ROC Area  Class
              1        0         1         1         1          1        1
              0.765    0         1         0.765     0.867      0.879    2
              1        0.118     0.8       1         0.889      0.938    3
Weighted Avg. 0.92     0.038     0.936     0.92      0.919      0.939

=== Confusion Matrix ===

  a  b  c   <-- classified as
 17  0  0 |  a = 1
  0 13  4 |  b = 2
  0  0 16 |  c = 3

关于matlab - 如何使用 MATLAB 从 WEKA 中检索类值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7535689/

相关文章:

MatLab-->csv : Export data to csv file using dlmwrite in Matlab

matlab - SIMULINK - 使用 Embedded Coder 时屏蔽子系统中可调/可变参数的影响

python - Keras 功能 API : fitting and testing model that takes multiple inputs

python - 等长样本的音频分类/'vocoder' thingy

matlab - 使用 ismember 列出所有索引

matlab - 从给定向量创建下三角矩阵

hadoop - Spark ml 模型保存到 hdfs

machine-learning - 在 Octave 中的 Coursera ML 上提交作业

python - 凯拉斯 + tensorflow : Debug NaNs

r - 如何在 R 中计算要删除的最少观察数以实现 2 组之间的完全可分离性