java - 如何使用apache Spark的MLlib的线性回归?

标签 java apache-spark apache-spark-mllib

我是apache Spark的新手,从MLlib的文档中,我找到了scala的示例,但我真的不知道scala,有人知道java中的示例吗?谢谢!示例代码为

import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint

// Load and parse the data
val data = sc.textFile("mllib/data/ridge-data/lpsa.data")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray)
}

// Building the model
val numIterations = 20
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ +     _)/valuesAndPreds.count
println("training Mean Squared Error = " + MSE)

摘自MLlib的文档 谢谢!

最佳答案

如文档中所示:

All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object.

这并不容易,因为您仍然必须在 java 中重现 scala 代码,但它可以工作(至少在本例中)。

话虽如此,这里是一个 java 实现:

public void linReg() {
    String master = "local";
    SparkConf conf = new SparkConf().setAppName("csvParser").setMaster(
            master);
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> data = sc.textFile("mllib/data/ridge-data/lpsa.data");
    JavaRDD<LabeledPoint> parseddata = data
            .map(new Function<String, LabeledPoint>() {
            // I see no ways of just using a lambda, hence more verbosity than with scala
                @Override
                public LabeledPoint call(String line) throws Exception {
                    String[] parts = line.split(",");
                    String[] pointsStr = parts[1].split(" ");
                    double[] points = new double[pointsStr.length];
                    for (int i = 0; i < pointsStr.length; i++)
                        points[i] = Double.valueOf(pointsStr[i]);
                    return new LabeledPoint(Double.valueOf(parts[0]),
                            Vectors.dense(points));
                }
            });

    // Building the model
    int numIterations = 20;
    LinearRegressionModel model = LinearRegressionWithSGD.train(
    parseddata.rdd(), numIterations); // notice the .rdd()

    // Evaluate model on training examples and compute training error
    JavaRDD<Tuple2<Double, Double>> valuesAndPred = parseddata
            .map(point -> new Tuple2<Double, Double>(point.label(), model
                    .predict(point.features())));
    // important point here is the Tuple2 explicit creation.

    double MSE = valuesAndPred.mapToDouble(
            tuple -> Math.pow(tuple._1 - tuple._2, 2)).mean();
    // you can compute the mean with this function, which is much easier
    System.out.println("training Mean Squared Error = "
            + String.valueOf(MSE));
}

它远非完美,但我希望它能让您更好地理解如何使用 Mllib 文档中的 scala 示例。

关于java - 如何使用apache Spark的MLlib的线性回归?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23942043/

相关文章:

apache-spark - Pyspark udf 内存利用率高

azure - Pyspark - 如何在 pyspark 中转换/Date(1593786688000+0200)/格式的日期/时间戳?

java - 在 Spark 外部加载 Mllib 模型

scala - 如何在 Spark MLlib 中设置自定义损失函数

java - 当 CSVParser 中有记录时,csvParser.getRecords() 返回空

java - 如何在 JUnit 中测试测试 Delete(),如何实际测试 J-Unit

java - Ant build.xml 不读取环境属性

java - 将表列解析为数组

python - 将 spark DataFrame 保存为 Hive 表的问题

apache-spark - 如何使用spark-ml处理分类特征?