weka - 在哪里可以找到 Java 中使用 weka 的 KNN 实际示例

标签 weka knn

我一直在寻找使用 weka 实现 KNN 的实际示例,但我发现的所有内容对我来说都太笼统,无法理解它需要能够工作的数据(或者可能如何制作它需要的对象)工作)以及它显示的结果,也许之前使用过它的人有一个更好的例子,比如现实的事物(产品、电影、书籍等),而不是你在代数上看到的典型字母。

所以我可以弄清楚如何在我的案例中实现它(这是用 KNN 向活跃用户推荐菜肴),将不胜感激,谢谢。

我试图通过这个链接来理解https://www.ibm.com/developerworks/library/os-weka3/index.html但我什至不明白他们是如何得到这个结果以及他们是如何得到公式

knn

第 1 步:确定距离公式

Distance = SQRT( ((58 - Age)/(69-35))^2) + ((51000 - Income)/(150000-38000))^2 )

为什么总是/(69-35) 和/(150000-38000) ?

编辑:

这是我尝试过但没有成功的代码,如果有人可以为我清除它,我很感激,我也通过结合这两个答案来完成此代码:

这个答案展示了如何获取 knn:

How to get the nearest neighbor in weka using java

这个告诉我如何创建实例(我真的不知道它们对于 weka 来说是什么)Adding a new Instance in weka

所以我想出了这个:

public class Wekatest {

    public static void main(String[] args) {

        ArrayList<Attribute> atts = new ArrayList<>();
        ArrayList<String> classVal = new ArrayList<>();
        // I don't really understand whats happening here
        classVal.add("A");
        classVal.add("B");
        classVal.add("C");
        classVal.add("D");
        classVal.add("E");
        classVal.add("F");

        atts.add(new Attribute("content", (ArrayList<String>) null));
        atts.add(new Attribute("@@class@@", classVal));

        // Here in my case the data to evaluate are dishes (plato mean dish in spanish)
        Instances dataRaw = new Instances("TestInstancesPlatos", atts, 0);

        // I imagine that every instance is like an Object that will be compared with the other instances, to get its neaerest neightbours (so an instance is like a dish for me)..

        double[] instanceValue1 = new double[dataRaw.numAttributes()];

        instanceValue1[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue1[1] = 0;

        dataRaw.add(new DenseInstance(1.0, instanceValue1));

        double[] instanceValue2 = new double[dataRaw.numAttributes()];

        instanceValue2[0] = dataRaw.attribute(0).addStringValue("Tunas");
        instanceValue2[1] = 1;

        dataRaw.add(new DenseInstance(1.0, instanceValue2));

        double[] instanceValue3 = new double[dataRaw.numAttributes()];

        instanceValue3[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue3[1] = 2;

        dataRaw.add(new DenseInstance(1.0, instanceValue3));

        double[] instanceValue4 = new double[dataRaw.numAttributes()];

        instanceValue4[0] = dataRaw.attribute(0).addStringValue("Hamburguers");
        instanceValue4[1] = 3;

        dataRaw.add(new DenseInstance(1.0, instanceValue4));

        double[] instanceValue5 = new double[dataRaw.numAttributes()];

        instanceValue5[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue5[1] = 4;

        dataRaw.add(new DenseInstance(1.0, instanceValue5));

        System.out.println("---------------------");

        weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(dataRaw);
        try {

            // This method receives the goal instance which you wanna know its neighbours and N (I don't really know what N is but I imagine it is the number of neighbours I want)
            Instances nearestInstances = knn.kNearestNeighbours(dataRaw.get(0), 1);
            // I expected the output to be the closes neighbour to dataRaw.get(0) which would be Pizzas, but instead I got some data that I don't really understand.


            System.out.println(nearestInstances);

        } catch (Exception e) {

            e.printStackTrace();
        }

    }

}

OUTPUT:

---------------------
@relation TestInstancesPlatos

@attribute content string
@attribute @@class@@ {A,B,C,D,E,F}

@data
Pizzas,A
Tunas,B
Pizzas,C
Hamburguers,D

使用的weka依赖:

<dependency>
        <groupId>nz.ac.waikato.cms.weka</groupId>
        <artifactId>weka-stable</artifactId>
        <version>3.8.0</version>
    </dependency>

最佳答案

KNN 是一种机器学习技术,通常被归类为“基于实例的预测器”。它获取分类样本的所有实例并将它们绘制在 n 维空间中。

使用欧几里得距离等算法,KNN 寻找这个 n 维空间中最近的点,并根据这些邻居估计它属于哪个类。如果它更接近蓝点,它是蓝色,如果它更接近红点......

但是现在,我们如何将其应用于您的问题?

想象一下,您只有两个属性:价格和卡路里(二维空间)。您希望将顾客分为三类:健康人群、垃圾食品人群、美食人群。这样,您就可以在餐厅中提供与客户偏好相似的优惠。

您有以下数据:

+-------+----------+-----------+
| Price | Calories | Food Type |
+-------+----------+-----------+
| $2    |    350   | Junk Food |
+-------+----------+-----------+
| $5    |    700   | Junk Food |
+-------+----------+-----------+
| $10   |    200   | Fit       |
+-------+----------+-----------+
| $3    |    400   | Junk Food |
+-------+----------+-----------+
| $8    |    150   | Fit       |
+-------+----------+-----------+
| $7    |    650   | Junk Food |
+-------+----------+-----------+
| $5    |    120   | Fit       |
+-------+----------+-----------+
| $25   |    230   | Gourmet   |
+-------+----------+-----------+
| $12   |    210   | Fit       |
+-------+----------+-----------+
| $40   |    475   | Gourmet   |
+-------+----------+-----------+
| $37   |    600   | Gourmet   |
+-------+----------+-----------+

现在,让我们看看它在 2D 空间中的绘制:

Plot

接下来会发生什么?

对于每个新条目,算法都会计算到所有点(实例)的距离并找到 k 个最近的点。从这 k 个最接近的类别中,它定义了新条目的类别。

取 k = 3,值为 15 美元和 165 卡路里。让我们找到 3 个最近的邻居:

New classif

这就是距离公式出现的地方。它实际上对每个点进行计算。然后对这些距离进行“排名”,k 个最接近的距离构成最终类别。

现在,为什么值是/(69-35) 和/(150000-38000)?正如其他答案中提到的,这是由于标准化造成的。我们的示例使用价格和卡路里。正如所见,卡路里比金钱更重要(每个值有更多单位)。为了避免不平衡,例如卡路里的类别比价格更有值(value)(例如,这会杀死美食类别),需要使所有属性同样重要,因此需要使用标准化。

Weka 为您抽象了这一点,但您也可以将其可视化。请参阅我为 Weka ML 类(class)制作的项目中的可视化示例:

WekaVisualize

请注意,由于有很多多于 2 的维度,因此有很多图,但想法是相似的。

解释代码:

public class Wekatest {

    public static void main(String[] args) {
//These two ArrayLists are the inputs of your algorithm.
//atts are the attributes that you're going to pass for training, usually called X.
//classVal is the target class that is to be predicted, usually called y.
        ArrayList<Attribute> atts = new ArrayList<>();
        ArrayList<String> classVal = new ArrayList<>();
//Here you initiate a "dictionary" of all distinct types of restaurants that can be targeted.
        classVal.add("A");
        classVal.add("B");
        classVal.add("C");
        classVal.add("D");
        classVal.add("E");
        classVal.add("F");
// The next two lines initiate the attributes, one made of "content" and other pertaining to the class of the already labeled values.
        atts.add(new Attribute("content", (ArrayList<String>) null));
        atts.add(new Attribute("@@class@@", classVal));

//This loads a Weka object of data for training, using attributes and classes from a file "TestInstancePlatos" (or should happen).
//dataRaw contains a set of previously labelled instances that are going to be used do "train the model" (kNN actually doesn't tain anything, but uses all data for predictions)
        Instances dataRaw = new Instances("TestInstancesPlatos", atts, 0);


//Here you're starting new instances to test your model. This is where you can substitute for new inputs for production.
        double[] instanceValue1 = new double[dataRaw.numAttributes()];

//It looks you only have 2 attributes, a food product and a rating maybe.
        instanceValue1[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue1[1] = 0;

//You're appending this new instance to the model for evaluation.
        dataRaw.add(new DenseInstance(1.0, instanceValue1));

        double[] instanceValue2 = new double[dataRaw.numAttributes()];

        instanceValue2[0] = dataRaw.attribute(0).addStringValue("Tunas");
        instanceValue2[1] = 1;

        dataRaw.add(new DenseInstance(1.0, instanceValue2));

        double[] instanceValue3 = new double[dataRaw.numAttributes()];

        instanceValue3[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue3[1] = 2;

        dataRaw.add(new DenseInstance(1.0, instanceValue3));

        double[] instanceValue4 = new double[dataRaw.numAttributes()];

        instanceValue4[0] = dataRaw.attribute(0).addStringValue("Hamburguers");
        instanceValue4[1] = 3;

        dataRaw.add(new DenseInstance(1.0, instanceValue4));

        double[] instanceValue5 = new double[dataRaw.numAttributes()];

        instanceValue5[0] = dataRaw.attribute(0).addStringValue("Pizzas");
        instanceValue5[1] = 4;

        dataRaw.add(new DenseInstance(1.0, instanceValue5));

// After adding 5 instances, time to test:
        System.out.println("---------------------");

//Load the algorithm with data.
        weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(dataRaw);
//You're predicting the class of value 0 of your data raw values. You're asking the answer among 1 neighbor (second attribute)
        try {
            Instances nearestInstances = knn.kNearestNeighbours(dataRaw.get(0), 1);
//You will get a value among A and F, that are the classes passed.
           System.out.println(nearestInstances);

        } catch (Exception e) {

            e.printStackTrace();
        }

    }

}

你应该怎么做?

-> Gather data. 
-> Define a set of attributes that help you to predict which cousine you have (ex.: prices, dishes or ingredients (have one attribute for each dish or ingredient). 
-> Organize this data. 
-> Define a set of labels.
-> Manually label a set of data.
-> Load labelled data to KNN.
-> Label new instances by passing their attributes to KNN. It'll return you the label of the k nearest neighbors (good values for k are 3 or 5, have to test).
-> Have fun!

关于weka - 在哪里可以找到 Java 中使用 weka 的 KNN 实际示例,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57739084/

相关文章:

python - np.linalg.norm(a-b) 和 np.sqrt(np.sum(np.square(a-b))) 之间的区别?

python - knn 的 y 轴样本不匹配

machine-learning - 如何处理C4.5(J48)决策树中缺失的属性值?

matlab - 我怎样才能有效地找到分类器的准确性

java - 哪个 WEKA 概率分类器?

java - Weka Example,文本行的简单分类

python - 提取第二和第三邻居时的代码,当第二和第三邻居不存在时将其忽略

c++ - OpenCV C++ 中的 createBackgroundSubtractorKNN 参数是什么?

machine-learning - Weka 的预测范围限制

Java - 在文本挖掘上实现机器学习方法