java - 如何在Java中分解特征向量？

我有一个数据框如下:

+---------------+--------------------+
|IndexedArtistID|     recommendations|
+---------------+--------------------+
|           1580|[[919, 0.00249262...|
|           4900|[[41749, 7.143963...|
|           5300|[[0, 2.0147272E-4...|
|           6620|[[208780, 9.81092...|
+---------------+--------------------+

我想拆分推荐列，以便获得如下数据框:

+---------------+--------------------+
|IndexedArtistID|     recommendations|
+---------------+--------------------+
|           1580|919                 |
|           1580|0.00249262          |
|           4900|41749               |
|           4900|7.143963            |
|           5300|0                   |
|           5300|2.0147272E-4        |
|           6620|208780              |
|           6620|9.81092             |
+---------------+--------------------+

基本上，我想将特征向量拆分为列，然后将这些列合并为单个列。合并部分描述于:How to split single row into multiple rows in Spark DataFrame using Java 。那么，如何使用java进行分割部分呢？对于scala，解释如下:Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)] ，但我无法找到一种方法来按照链接中给出的方式在 java 中进行操作。

数据框的架构如下，IndexedUserID 的值将被纳入新创建的推荐列中:

root
 |-- IndexedArtistID: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- IndexedUserID: integer (nullable = true)
 |    |    |-- rating: float (nullable = true)

最佳答案

我尝试找到这个问题的解决方案，我必须说，对于人们在 python 和 scala for Spark 中遇到的问题，有很多内容可用，但在 java 中可用的内容却很少。因此，解决方案如下:

List<ElementStruct> structElements = dataFrameWithFeatures.javaRDD().map(row -> {
        int artistId = row.getInt(0);
        List<Object> recommendations = row.getList(1);
        return new ElementStruct(artistId, recommendations);
    }).collect();

    List<Recommendation> recommendations = new ArrayList<>();
    for (ElementStruct element : structElements) {
        List<Object> features = element.getFeatures();
        int artistId = element.getArtistId();
        for (int i = 0; i < features.size(); i++) {
            Object o = ((GenericRowWithSchema) features.get(i)).get(0);
            recommendations.add(new Recommendation(artistId, (int) o));
        }
    }
    SparkSession sparkSession = SessionCreator.getOrCreateSparkSession();
    Dataset<Row> decomposedDataframe = sparkSession.createDataFrame(recommendations, Recommendation.class);

ElementStruct 类

import java.io.Serializable;
import java.util.List;

public class ElementStruct implements Serializable {
    private int artistId;
    private List<Object> features;

    public ElementStruct(int artistId, List<Object> features) {
        this.artistId = artistId;
        this.features = features;
    }

    public int getArtistId() {
        return artistId;
    }

    public void setArtistId(int artistId) {
        this.artistId = artistId;
    }

    public List<Object> getFeatures() {
        return features;
    }

    public void setFeatures(List<Object> features) {
        this.features = features;
    }
}

推荐类

import java.io.Serializable;

public class Recommendation implements Serializable {
    private int artistId;
    private int userId;

    public Recommendation(int artistId, int userId){
        this.artistId = artistId;
        this.userId = userId;
    }

    public int getArtistId() {
        return artistId;
    }

    public void setArtistId(int artistId) {
        this.artistId = artistId;
    }

    public int getUserId() {
        return userId;
    }

    public void setUserId(int userId) {
        this.userId = userId;
    }
}

说明: 1. 对于数据框中的每一行，以列表形式获取艺术家和特征，以便于进一步处理。将这些艺术家和功能列表存储为 java 对象(在本例中为 Element 结构)。

对于功能列表中的每个艺术家和元素，创建一个新的对象列表(在本例中为推荐)并将每个对象存储在该列表中。

最后，根据第二步中获得的对象列表创建一个数据框。

结果:

root
 |-- artistId: integer (nullable = false)
 |-- userId: integer (nullable = false)

+---------------+----------------+
|       artistId|          userId|
+---------------+----------------+
|           1580|919             |
|           1580|0.00249262      |
|           4900|41749           |
|           4900|7.143963        |
|           5300|0               |
|           5300|2.0147272E-4    |
|           6620|208780          |
|           6620|9.81092         |
+---------------+----------------+

关于java - 如何在Java中分解特征向量？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59755022/

java - 如何在Java中分解特征向量？

上一篇：linux - 在Java程序中使用bash检查文件是否存在？

下一篇：php - 这段代码在PHP中是什么意思？